<a href="https://colab.research.google.com/github/Muzznah/Module-16/blob/master/TokenizeData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [21]:
# Install Java, Spark, and Findspark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
!tar xf spark-3.0.0-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

In [4]:
# Start Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Tokens").getOrCreate()

In [5]:
#Import Tokenizer Library,
from pyspark.ml.feature import Tokenizer

Spark gives us the ability to create a DataFrame from scratch as well. Although you’ll mainly use DataFrames imported from data, the ability to create quick, small DataFrames allows for quick, easy testing. We’ll create a small DataFrame that will show the pre-tokenized data, using the following code:

In [13]:
# Create sample dataframe.
dataframe=spark.createDataFrame([(0,'Spark is great'),(1, 'We are learning Spark'), (2,'Spark is better than Hadoop no doubt')],['id','sentences'])
dataframe.show()

+---+--------------------+
| id|           sentences|
+---+--------------------+
|  0|      Spark is great|
|  1|We are learning S...|
|  2|Spark is better t...|
+---+--------------------+



In [14]:
# Tokenize sentences
tokenizer = Tokenizer(inputCol="sentences", outputCol="words")
tokenizer

Tokenizer_832152c63dd2

In [15]:
# Transform and show Dataframe.
tokenized_df=tokenizer.transform(dataframe)
tokenized_df.show(truncate=False)

+---+------------------------------------+--------------------------------------------+
|id |sentences                           |words                                       |
+---+------------------------------------+--------------------------------------------+
|0  |Spark is great                      |[spark, is, great]                          |
|1  |We are learning Spark               |[we, are, learning, spark]                  |
|2  |Spark is better than Hadoop no doubt|[spark, is, better, than, hadoop, no, doubt]|
+---+------------------------------------+--------------------------------------------+



In [16]:
# Create a function to return the length of a list
def word_list_length(word_list):
	return len(word_list)

In [17]:
# Import dependencies.
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

In [19]:
# Create a user defined function
count_tokens = udf(word_list_length, IntegerType())

In [20]:
# Tokenize sentences
tokenizer = Tokenizer(inputCol="sentences", outputCol="words")

# Transform Dataframe.
tokenized_df=tokenizer.transform(dataframe)

# Select the needed columns and dont truncate results.
tokenized_df.withColumn("tokens", count_tokens(col('words'))).show(truncate=False)

+---+------------------------------------+--------------------------------------------+------+
|id |sentences                           |words                                       |tokens|
+---+------------------------------------+--------------------------------------------+------+
|0  |Spark is great                      |[spark, is, great]                          |3     |
|1  |We are learning Spark               |[we, are, learning, spark]                  |4     |
|2  |Spark is better than Hadoop no doubt|[spark, is, better, than, hadoop, no, doubt]|7     |
+---+------------------------------------+--------------------------------------------+------+



# **SKILL DRILL**
 ***Combine both tokenizer and StopWordsRemover on a DataFrame that isn’t already broken out into a list of words.***

In [22]:
tokenized_df.show()

+---+--------------------+--------------------+
| id|           sentences|               words|
+---+--------------------+--------------------+
|  0|      Spark is great|  [spark, is, great]|
|  1|We are learning S...|[we, are, learnin...|
|  2|Spark is better t...|[spark, is, bette...|
+---+--------------------+--------------------+



In [23]:
# Import stop words library
from pyspark.ml.feature import StopWordsRemover

In [24]:
# Run the Remover
remover = StopWordsRemover(inputCol="words", outputCol="filtered")


In [25]:
# Transform and show data.
remover.transform(tokenized_df).show(truncate=False)

+---+------------------------------------+--------------------------------------------+------------------------------+
|id |sentences                           |words                                       |filtered                      |
+---+------------------------------------+--------------------------------------------+------------------------------+
|0  |Spark is great                      |[spark, is, great]                          |[spark, great]                |
|1  |We are learning Spark               |[we, are, learning, spark]                  |[learning, spark]             |
|2  |Spark is better than Hadoop no doubt|[spark, is, better, than, hadoop, no, doubt]|[spark, better, hadoop, doubt]|
+---+------------------------------------+--------------------------------------------+------------------------------+

