<a href="https://colab.research.google.com/github/StanleyNyadzayo/eee408labs/blob/MLCP-1-Study-Supervised-vs-Unsupervised-learning-purpose-problem-types/lab8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Natural Language Processing

In [44]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('nlp').getOrCreate()
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType


Tokenizer and RegexTokenizer are Classes using for splitting text into individual words or tokens. Tokenizer uses whitespace and RegexTokenizer uses regular expressions to define how the text should be split. Both are key for the NLP pipeline, preparing data for further analysis or model trainig

**col** references a column in a DataFrame.
**udf** means a user defined function, making it possible for a user to define functions in python, scala or java. it can work when hanlding calculations that are not supported by the built in functions of PySpark.
**IntergerType** is a data type in PySpark SQL representing 32 bit integer. it is useful when working with datat types that require specifying expected data type of a column


# DataFrame for NLP

In [45]:
spark = SparkSession.builder.appName('nlp').getOrCreate()
testSents = spark.createDataFrame([
    (0, 'Eight months into the PhD!'),
    (1, 'AI 2 Module is almost complete and Christmas is near'),
    (2, 'Lets see what these tokens are all about'),
    (3, 'The weather is doing alright!!!')
],['id','sentence']) # defines the column names for the DataFrame
testSents.show(truncate=False) # show() truncates the sentences so setting it to false will display full sentences
testSents.printSchema() #to show the data types

+---+----------------------------------------------------+
|id |sentence                                            |
+---+----------------------------------------------------+
|0  |Eight months into the PhD!                          |
|1  |AI 2 Module is almost complete and Christmas is near|
|2  |Lets see what these tokens are all about            |
|3  |The weather is doing alright!!!                     |
+---+----------------------------------------------------+

root
 |-- id: long (nullable = true)
 |-- sentence: string (nullable = true)



#Tokenization

In [46]:
tokens = Tokenizer(inputCol='sentence', outputCol='words')
countTokens = udf(lambda words: len(words), IntegerType())
tokenized = tokens.transform(testSents)
tokenized.withColumn('tokens', countTokens(col('words'))).show(truncate=False)

+---+----------------------------------------------------+---------------------------------------------------------------+------+
|id |sentence                                            |words                                                          |tokens|
+---+----------------------------------------------------+---------------------------------------------------------------+------+
|0  |Eight months into the PhD!                          |[eight, months, into, the, phd!]                               |5     |
|1  |AI 2 Module is almost complete and Christmas is near|[ai, 2, module, is, almost, complete, and, christmas, is, near]|10    |
|2  |Lets see what these tokens are all about            |[lets, see, what, these, tokens, are, all, about]              |8     |
|3  |The weather is doing alright!!!                     |[the, weather, is, doing, alright!!!]                          |5     |
+---+----------------------------------------------------+--------------------------------

**RegexTokenizer allows more advanced tokenization based on regular expression
(regex) matching. By default, the parameter “pattern” (regex, default: "\\s+") is used as
delimiters to split the input text on spaces.**

In [47]:
regexTokens2 = RegexTokenizer(inputCol='sentence', outputCol='words', pattern='\\W')
regexTokenized = regexTokens2.transform(testSents)
regexTokenized.select(col("sentence"), col("words")).withColumn("tokens", countTokens(col("words"))).show(truncate=False)

+----------------------------------------------------+---------------------------------------------------------------+------+
|sentence                                            |words                                                          |tokens|
+----------------------------------------------------+---------------------------------------------------------------+------+
|Eight months into the PhD!                          |[eight, months, into, the, phd]                                |5     |
|AI 2 Module is almost complete and Christmas is near|[ai, 2, module, is, almost, complete, and, christmas, is, near]|10    |
|Lets see what these tokens are all about            |[lets, see, what, these, tokens, are, all, about]              |8     |
|The weather is doing alright!!!                     |[the, weather, is, doing, alright]                             |5     |
+----------------------------------------------------+---------------------------------------------------------------+

TF-IDF is a feature vectorization method widely used in text mining to rflect the importance of a term to a document in the corpus.

In [48]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(tokenized)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
rescaledData.select("id", "features").show(truncate=True)

+---+--------------------+
| id|            features|
+---+--------------------+
|  0|(20,[0,2,6,11,17]...|
|  1|(20,[6,8,9,11,12,...|
|  2|(20,[3,6,11,12,14...|
|  3|(20,[8,9,13,15,17...|
+---+--------------------+



**StopWordsRemover:**
Stop words are words which should be excluded from the input, typically because the
words appear frequently and don’t carry as much meaning.

In [49]:
from pyspark.ml.feature import StopWordsRemover
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
remover.transform(tokenized).show(truncate=False)

+---+----------------------------------------------------+---------------------------------------------------------------+--------------------------------------------------+
|id |sentence                                            |words                                                          |filtered                                          |
+---+----------------------------------------------------+---------------------------------------------------------------+--------------------------------------------------+
|0  |Eight months into the PhD!                          |[eight, months, into, the, phd!]                               |[eight, months, phd!]                             |
|1  |AI 2 Module is almost complete and Christmas is near|[ai, 2, module, is, almost, complete, and, christmas, is, near]|[ai, 2, module, almost, complete, christmas, near]|
|2  |Lets see what these tokens are all about            |[lets, see, what, these, tokens, are, all, about]              |[lets, s