<a href="https://colab.research.google.com/github/MikeHankinson/Amazon_Vine_Analysis/blob/main/Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Module 16 Natural Language Processing**

**16.5.5 NLP Process Pipeline:**
1.  Raw Text
2.  Tokenize
3.  Stop Words Filtering
4.  Term Frequency-Inverse/Document Frequency Weight (TF-IDF)
5. Machine Learning (Run the Model)
6. Verify Model

---





**Install PySpark** 
PySPark does not come native to Google Colab

---


In [None]:
import os
# Find the latest version of spark 3.0  from http://www-us.apache.org/dist/spark/ and enter as the spark version
# For example:
# spark_version = 'spark-3.0.1'
spark_version = 'spark-3.0.1'
os.environ['SPARK_VERSION']=spark_version

# Install Spark and Java
!apt-get update
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop2.7.tgz
!tar xf $SPARK_VERSION-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{spark_version}-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()


0% [Working]            Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
0% [Connecting to archive.ubuntu.com (91.189.88.142)] [1 InRelease 14.2 kB/88.70% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpad.net                                                                               Get:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
0% [Waiting for headers] [Connecting to ppa.launchpad.net (91.189.95.85)] [Wait0% [1 InRelease gpgv 88.7 kB] [Waiting for headers] [Connecting to ppa.launchpa                                                                               Ign:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
0% [1 InRelease gpgv 88.7 kB] [Waiting for headers] [Connecting to ppa.launchpa                                                                               Hit:4 http://archive.ubuntu.com/ubuntu bionic InRelease
0% [1 InRelease gpg

In [None]:
#Shouldn't need to run this again.  Keep just in case.
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

16.5.3 **Tokenize** Sentence by Word and **Part of Speech Tagging**:  Natural Language Tool Kit (**NLTK**)  

---



In [None]:
import nltk
from nltk import word_tokenize
text = word_tokenize("Misty enjoys walking on the trails")
output = nltk.pos_tag(text)
print(output)

[('Misty', 'NNP'), ('enjoys', 'VBZ'), ('walking', 'VBG'), ('on', 'IN'), ('the', 'DT'), ('trails', 'NNS')]


16.6.1 **Tokenize Data**: PySpark Machine Learning Library

---

In [None]:
# Start Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Tokens").getOrCreate()

In [None]:
# Import the Tokenizer Library
from pyspark.ml.feature import Tokenizer

In [None]:
# Create a Sample DataFrame
dataframe = spark.createDataFrame([(0, "Spark is Great"),
                                  (1, "We are learning Spark"),
                                  (2,"Spark is better than Hadoop, no doubt")],
                                 ["ID", "Sentence"]
                                 )
dataframe.show(truncate=False)

+---+-------------------------------------+
|ID |Sentence                             |
+---+-------------------------------------+
|0  |Spark is Great                       |
|1  |We are learning Spark                |
|2  |Spark is better than Hadoop, no doubt|
+---+-------------------------------------+



In [None]:
# The tokenizer function takes input and output parameters. 
# The input passes the name of the column that we want to have tokenized. 
# The output takes the name that we want the column called.

# Tokenize sentences from our dataframe 
# (This is a Transformation -- so, no output)
tokenizer = Tokenizer(inputCol="Sentence", outputCol="words")
tokenizer

Tokenizer_e545f98955ab

In [None]:
# Transform and Show Dataframe
# The created tokenizer uses a transform method that takes a DataFrame as input.
# (tokenizer looks similar to the spit() method)
tokenized_df = tokenizer.transform(dataframe)
tokenized_df.show(truncate=False)


# for later on...
sentenceData = tokenized_df

+---+-------------------------------------+---------------------------------------------+
|ID |Sentence                             |words                                        |
+---+-------------------------------------+---------------------------------------------+
|0  |Spark is Great                       |[spark, is, great]                           |
|1  |We are learning Spark                |[we, are, learning, spark]                   |
|2  |Spark is better than Hadoop, no doubt|[spark, is, better, than, hadoop,, no, doubt]|
+---+-------------------------------------+---------------------------------------------+



In [None]:
# Not understanding this next section within 16.6.1 
# User-defined functions (UDFs) are functions created by the user to add 
# custom output columns. 

# Example below creates a function that enhances the tokenizer by
# returning a word count for each line.  

# Create a function to return the length of a list
def word_list_length(word_list):
    return len(word_list)

# next, import  
#   1. the udf function, 
#   2. the col function to select a column to be passed into a function, and
#   3. the type IntegerType that will be used in our udf to define the data type of the output 
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

# Create a user defined function
count_tokens = udf(word_list_length, IntegerType())

# Now redo the tokenizer process
# this time, after the DataFrame has outputted the tokenized values,
# use our own created function to return the number of tokens created.

# Create a Tokenizer
tokenizer = Tokenizer(inputCol="Sentence", outputCol="words")

# Transform the dataframe
tokenized_df = tokenizer.transform(dataframe)

# Select the needed columns and don't truncate the results
tokenized_df.withColumn("tokens", count_tokens(col("words"))).show(truncate=False)
 

+---+-------------------------------------+---------------------------------------------+------+
|ID |Sentence                             |words                                        |tokens|
+---+-------------------------------------+---------------------------------------------+------+
|0  |Spark is Great                       |[spark, is, great]                           |3     |
|1  |We are learning Spark                |[we, are, learning, spark]                   |4     |
|2  |Spark is better than Hadoop, no doubt|[spark, is, better, than, hadoop,, no, doubt]|7     |
+---+-------------------------------------+---------------------------------------------+------+



16.6.2 **Stop Words**: Have little linguistic values in natural language processing (a, and, the, ...)

---

In [None]:
# Import stop words library
from pyspark.ml.feature import StopWordsRemover

# Run the Remover
remover = StopWordsRemover(inputCol="words", outputCol="filtered")


# Transform and Show Data / Use the tokenized dataframe from above
remover.transform(tokenized_df).show(truncate=False)

+---+-------------------------------------+---------------------------------------------+-------------------------------+
|ID |Sentence                             |words                                        |filtered                       |
+---+-------------------------------------+---------------------------------------------+-------------------------------+
|0  |Spark is Great                       |[spark, is, great]                           |[spark, great]                 |
|1  |We are learning Spark                |[we, are, learning, spark]                   |[learning, spark]              |
|2  |Spark is better than Hadoop, no doubt|[spark, is, better, than, hadoop,, no, doubt]|[spark, better, hadoop,, doubt]|
+---+-------------------------------------+---------------------------------------------+-------------------------------+



16.6.3 **Term Frequency - Inverse Document Frequency Weight (TF-IDF)**

---
1. **Term frequency (TF)** measures the frequency of a word occurring in a document =>

TF = Number time word used in article / Number of words in article

2. **inverse document frequency (IDF)** measures the significance of a word across a set of documents =>

IDF = log(total articles / articles that contain the word Python)

3. TF-IDF = TF * IDF

**Note: **Need to convert all the text to a numerical format by

**HashingTF** -- converts words to numeric IDs. The same words are assigned the same IDs and then mapped to an index and counted, and a vector is returned. 


In [None]:
# Import Libraries
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, StopWordsRemover

In [None]:
# 1. Raw Data
# -------------------------

# Read in data from S3 Buckets
from pyspark import SparkFiles
url ="https://s3.amazonaws.com/dataviz-curriculum/day_2/airlines.csv"
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("airlines.csv"), sep=",", header=True)

# Show DataFrame
df.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------+
|Airline Tweets                                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------+
|@VirginAmerica plus you've added commercials to the experience... tacky.                                                               |
|@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing. it's really the only bad thing about flying VA|
|@VirginAmerica do you miss me? Don't worry we'll be together very soon.                                                                |
|@VirginAmerica Are the hours of operation for the Club at SFO that are posted online current?                                          |
|@VirginAmerica awaiting my return

In [None]:
# 2. Tokenize Dataframe
# -------------------------

# Tokenize DataFrame
tokened = Tokenizer(inputCol="Airline Tweets", outputCol="words")
tokened_transformed = tokened.transform(df)
tokened_transformed.show()

+--------------------+--------------------+
|      Airline Tweets|               words|
+--------------------+--------------------+
|@VirginAmerica pl...|[@virginamerica, ...|
|@VirginAmerica se...|[@virginamerica, ...|
|@VirginAmerica do...|[@virginamerica, ...|
|@VirginAmerica Ar...|[@virginamerica, ...|
|@VirginAmerica aw...|[@virginamerica, ...|
+--------------------+--------------------+



In [None]:
# 3. Remove Stop Words
# -------------------------
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
removed_frame = remover.transform(tokened_transformed)
removed_frame.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------+
|Airline Tweets                                                                                                                         |words                                                                                                                                                          |filtered                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------

In [None]:
# 4. Term Frequency - Inverse Document Frequency Weight (TF-IDF)
# -------------------------

# a. The HashingTF function takes an argument for an input column, an output column,
# and a numFeature parameter, which specifies the number of buckets for the split words.

# Run the hashing term frequency 
hashing = HashingTF(inputCol="filtered", outputCol="hashedValues",numFeatures=pow(2,18))

# Transform into a DF
hashed_df= hashing.transform(removed_frame)
hashed_df.show()

#--------------------------------------------------------------------------------------
# ---NOTE BELOW: hasedValues column shows the INDEX for each unique word and its FREQUENCY---
#                With the words successfully converted to numbers, plug it all into an IDFModel, 
#                which will scale the values while down-weighting based on document frequency. 
#--------------------------------------------------------------------------------------

+--------------------+--------------------+--------------------+--------------------+
|      Airline Tweets|               words|            filtered|        hashedValues|
+--------------------+--------------------+--------------------+--------------------+
|@VirginAmerica pl...|[@virginamerica, ...|[@virginamerica, ...|(262144,[1419,999...|
|@VirginAmerica se...|[@virginamerica, ...|[@virginamerica, ...|(262144,[30053,44...|
|@VirginAmerica do...|[@virginamerica, ...|[@virginamerica, ...|(262144,[107065,1...|
|@VirginAmerica Ar...|[@virginamerica, ...|[@virginamerica, ...|(262144,[9641,506...|
|@VirginAmerica aw...|[@virginamerica, ...|[@virginamerica, ...|(262144,[6122,505...|
+--------------------+--------------------+--------------------+--------------------+



In [None]:
# b. Run the IDF Model

# Fit the IDF on the data set
idf = IDF(inputCol="hashedValues", outputCol = "features")
idfModel = idf.fit(hashed_df)
rescaledData = idfModel.transform(hashed_df)

# Display the dataframe
rescaledData.select("words", "features").show(truncate=False)


+---------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|words                                                                                                                                                          |features                                                                                                                                                                                                                                                                                                        |
+-----------------------------------------------------------------

16.6.4 and 16.6.5 **Machine Learning / Run the Model**

---
1. 16.6.4 **Pipeline Setup**
2. 16.6.5 **Run the Model**


***!!!!Grrrr....Why aren't we using the same data throughout this process??  Again, we'll import another data set.!!!***

In [None]:
# 1. 16.6.4 Pipeline Setup
# +++++++++++++++++++++++++

# Read in data from S3 Buckets
from pyspark import SparkFiles
url ="https://s3.amazonaws.com/dataviz-curriculum/day_2/yelp_reviews.csv"
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("yelp_reviews.csv"), sep=",", header=True)

# Show DataFrame
df.show()

+--------+--------------------+
|   class|                text|
+--------+--------------------+
|positive|Wow... Loved this...|
|negative|  Crust is not good.|
|negative|Not tasty and the...|
|positive|Stopped by during...|
|positive|The selection on ...|
|negative|Now I am getting ...|
|negative|Honeslty it didn'...|
|negative|The potatoes were...|
|positive|The fries were gr...|
|positive|      A great touch.|
|positive|Service was very ...|
|negative|  Would not go back.|
|negative|The cashier had n...|
|positive|I tried the Cape ...|
|negative|I was disgusted b...|
|negative|I was shocked bec...|
|positive| Highly recommended.|
|negative|Waitress was a li...|
|negative|This place is not...|
|negative|did not like at all.|
+--------+--------------------+
only showing top 20 rows



In [None]:
# Import functions
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF, StringIndexer

In [None]:
# create a new column that uses the lengthfunction to create a future feature 
# with the length of each row. This is similar to the tokenizer phase in which 
# we created the udf to do the same thing. 
# A udf could be used here, but PySpark is easier by supplying a ready-to-use function.


from pyspark.sql.functions import length
# Create a length column to be used as a future feature
data_df = df.withColumn('length', length(df['text']))
data_df.show()

+--------+--------------------+------+
|   class|                text|length|
+--------+--------------------+------+
|positive|Wow... Loved this...|    24|
|negative|  Crust is not good.|    18|
|negative|Not tasty and the...|    41|
|positive|Stopped by during...|    87|
|positive|The selection on ...|    59|
|negative|Now I am getting ...|    46|
|negative|Honeslty it didn'...|    37|
|negative|The potatoes were...|   111|
|positive|The fries were gr...|    25|
|positive|      A great touch.|    14|
|positive|Service was very ...|    24|
|negative|  Would not go back.|    18|
|negative|The cashier had n...|    99|
|positive|I tried the Cape ...|    59|
|negative|I was disgusted b...|    62|
|negative|I was shocked bec...|    50|
|positive| Highly recommended.|    19|
|negative|Waitress was a li...|    38|
|negative|This place is not...|    51|
|negative|did not like at all.|    20|
+--------+--------------------+------+
only showing top 20 rows



In [None]:
# Tokenize, Stop Words Filter, TF and IDF
# -------------------------

# NOTE: the StringIndexer encodes a string column to a column of table indexes

# Here we are working with positive and negative game reviews, 
# which will be converted to 0 and 1. This will form our labels, 
# which we'll delve into in the ML unit. 
# The label is what we're trying to predict: 
# will the review's given text let us know if it was positive or negative?

pos_neg_to_num = StringIndexer(inputCol='class',outputCol='label')
tokenizer = Tokenizer(inputCol="text", outputCol="token_text")
stopremove = StopWordsRemover(inputCol='token_text',outputCol='stop_tokens')
hashingTF = HashingTF(inputCol="stop_tokens", outputCol='hash_token')
idf = IDF(inputCol='hash_token', outputCol='idf_token')

In [None]:
# Create a feature vector containing the output from the IDFModel 
#(the last stage in the pipeline) and the length. 
# This combines all the raw features to train the ML model that we'll be using.

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vector
# Create feature vectors
clean_up = VectorAssembler(inputCols=['idf_token', 'length'], outputCol='features')

In [None]:
# Create and run a data processing Pipeline
# import the pipeline from pyspark.ml, and then store a list of the stages 
# created earlier. It's important to list the stages in the order they need to be executed. 
# REMEMBER the output from one stage will then be passed off to another stage.


from pyspark.ml import Pipeline
data_prep_pipeline = Pipeline(stages=[pos_neg_to_num, tokenizer, stopremove, hashingTF, idf, clean_up])

In [None]:
# 2. 16.6.5 Run the Model
# +++++++++++++++++++++++++

# Fit and transform the pipeline
cleaner = data_prep_pipeline.fit(data_df)
cleaned = cleaner.transform(data_df)


# Show label and resulting features
cleaned.select(['label', 'features']).show()


# NOTE: The labels and features that were created early in the process are 
#      numerical representations of positive and negative reviews. 
#      The features will be used in the model to predict whether a given review 
#      will be positive or negative. 

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  1.0|(262145,[177414,2...|
|  0.0|(262145,[49815,23...|
|  0.0|(262145,[109649,1...|
|  1.0|(262145,[53101,68...|
|  1.0|(262145,[15370,77...|
|  0.0|(262145,[98142,13...|
|  0.0|(262145,[59172,22...|
|  0.0|(262145,[63420,85...|
|  1.0|(262145,[53777,17...|
|  1.0|(262145,[221827,2...|
|  1.0|(262145,[43756,22...|
|  0.0|(262145,[127310,1...|
|  0.0|(262145,[407,3153...|
|  1.0|(262145,[18098,93...|
|  0.0|(262145,[23071,12...|
|  0.0|(262145,[129941,1...|
|  1.0|(262145,[19633,21...|
|  0.0|(262145,[27707,65...|
|  0.0|(262145,[20891,27...|
|  0.0|(262145,[8287,208...|
+-----+--------------------+
only showing top 20 rows



In [None]:
# Run the model on the data
# a. 70% Training Data
# b. 30% Testing Data
# c. seed number = 21, arbitrary but ensures reproducible results
# d. Use Naive Bayes Classifier  

# Break data down into a training set and a testing set
training, testing = cleaned.randomSplit([0.7, 0.3], 21)

from pyspark.ml.classification import NaiveBayes
# Create a Naive Bayes model and fit training data
nb = NaiveBayes()
predictor = nb.fit(training)

#Transform the model with testing data
test_results = predictor.transform(testing)
test_results.show(5)


# *** Prediction Column: 0 = postive review
#                        1 = negative review



+--------+--------------------+------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|   class|                text|length|label|          token_text|         stop_tokens|          hash_token|           idf_token|            features|       rawPrediction|         probability|prediction|
+--------+--------------------+------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|negative|"The burger... I ...|    86|  0.0|["the, burger...,...|["the, burger...,...|(262144,[20298,21...|(262144,[20298,21...|(262145,[20298,21...|[-820.60780566975...|[0.99999999999995...|       0.0|
|negative|              #NAME?|     6|  0.0|            [#name?]|            [#name?]|(262144,[197050],...|(262144,[197050],...|(262145,[197050,2...|[-73.489435340867...|[0.07515735596910.

16.6.5 **Verify the Model**

---
1. Import the **BinaryClassificationEvaluator**...uses two arguments to determine accurcay

  a. **labelCol**: takes the labels which were the result of using StringIndexer to convert our positive and negative strings to integers. 

  b. **rawPredictionCol**: akes in numerical predictions from the output of running the Naive Bayes model

  Model performance can be measured based on the difference between its predicted values and actual values. 

  (discuss model accuracy, precision and sensitivity later)


In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
acc_eval = BinaryClassificationEvaluator(labelCol='label', rawPredictionCol='prediction')
acc = acc_eval.evaluate(test_results)
print("Accuracy of model at predicting reviews was: %f" % acc)


Accuracy of model at predicting reviews was: 0.700298
