### AST 5: ETL pipeline for Text Mining and Analytics

At the end of the experiment, you will be able to:

* perform text mining and analytics using Spark SQL functions
* use Spark’s built-in and external data sources to write data in different file formats as part of the extract, transform, and load (ETL) tasks


## Information

The basic terminology related to text analytics are

* **Text**: a sequence of words and punctuation
* **Corpus**: a large body of text
* **Frequency distribution**: the frequency of words in a text object
* **Collocation**: a sequence of words that occur together unusually often
* **Bigrams**: word pairs. High frequent bigrams are collocations
* **Text normalization**: the process of transforming text into a single canonical form, e.g., converting text to lowercase, removing punctuations and stop words.

### Introduction

Text analytics is the process of deriving information from text. It usually involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging, information extraction, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP), different types of algorithms and analytical methods.

Here we will consider `milton-paradise.txt` text file from Gutenberg corpus to do text mining and analytics. Starting from data extraction, we will perform various transformations on text including tokenization, the number of words counting, POS tagging, chunking and then store it in different file formats.

### Setup Steps:

### Install Pyspark

In [None]:
!pip install pyspark



### Import required packages

In [None]:
from pyspark.sql import SparkSession
from matplotlib import pyplot as plt
import pandas as pd
import string
from nltk import Tree
from pyspark.ml.feature import NGram
from pyspark.ml import Pipeline
from pyspark.sql.types import *
from pyspark.sql.functions import *
import nltk

### Start a Spark Session

Spark session is a combined entry point of a Spark application, which came into implementation from Spark 2.0. Instead of having various context, everything is now encapsulated in a Spark session.

In [None]:
# Start spark session
spark = SparkSession.builder.appName('ETL text data').getOrCreate()
spark

### Text Analytics

#### Get the text data

The raw text is from the Gutenberg corpus from the nltk package. Get file ids in Gutenberg corpus.

In [None]:
nltk.download('gutenberg')

# Download dependencies for sent_tokenize()
nltk.download('punkt_tab')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
from nltk.corpus import gutenberg
gutenberg_fileids = gutenberg.fileids()
gutenberg_fileids

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

The file id is `milton-paradise.txt`. Use the nltk.sent_tokenize() function to split text into sentences.

In [None]:
milton_paradise = gutenberg.raw('milton-paradise.txt')
#print(milton_paradise)
pdf = pd.DataFrame({'sentences': nltk.sent_tokenize(milton_paradise)})
d = spark.createDataFrame(pdf)
d.show(1, truncate= False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|sentences                                                                                                                                                                                                                                    

From above it can be seen that empty spaces are present in the data.

#### Transform Data

* Remove trailing spaces

In [None]:
# Transform data
d_x = d.withColumn("sentences", regexp_replace(col("sentences"), "\\n+",""))
d_x.show(5, truncate= False)

In [None]:
# Transform data
d1 = d.withColumn("sentences", regexp_replace(col("sentences"), "\\s+"," "))       # replace all spaces with with one space
d1 = d1.withColumn("sentences", trim(col("sentences")))                            # remove trailing spaces ["   Spark", "Spark  ", " Spark"]
d1.show(5, truncate= False)

In [None]:
# Transform data
# d1 = d.withColumn("sentences", regexp_replace(col("sentences"), "\\s+","_"))       # replace all spaces with underscore
# d1 = d1.withColumn("sentences", regexp_replace(col("sentences"), "_"," "))         # replace all underscores with one space
# d1 = d1.withColumn("sentences", trim(col("sentences")))                            # remove trailing spaces

In [None]:
d1.show(5, truncate= False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|sentences                                                                                                                                                                                                                                                                                  

In [None]:
# Check for empty lines
d1.where(col("sentences")=="").count()

##### Word Tokenization

It is the process of breaking down a paragraph, a sentence or a complete text corpus into an array of words.

In [None]:
from nltk.tokenize import word_tokenize

word_udf = udf(lambda x: word_tokenize(x), ArrayType(StringType()))
d2 = d1.withColumn("words", word_udf("sentences"))

In [None]:
d2.show(5)

From above it can be seen that data has punctuations in it.

* **Remove punctuation and stopwords**

In [None]:
# Download stopwords
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

In [None]:

punctuation = string.punctuation
print(punctuation)

In [None]:
# Transform data
punct_udf = udf(lambda x: [w for w in x if not w.lower() in punctuation if not w.lower() in stop_words])
d3 = d2.withColumn("words", punct_udf("words"))
d3.show(5)

In [None]:
# Convert dataframe column to arraytype for further processing

array_udf = udf(lambda x: x, ArrayType(StringType()))
d4 = d3.withColumn("words", array_udf("words"))
d4.show(5)

##### Ngrams and collocations

Collocation is a sequence of words that occur together unusually often.
Bigrams: word pairs. High frequent bigrams are collocations.

Let's see how we transform texts to 2-grams, 3-grams, and 4-grams collocations.

In [None]:
ngrams = [NGram(n=n, inputCol='words', outputCol=str(n)+'-grams') for n in [2,3,4]]

# build pipeline model
pipeline = Pipeline(stages=ngrams)

# transform data
texts_ngrams = pipeline.fit(d4).transform(d4)

In [None]:
# display result
texts_ngrams.select('2-grams').show(6, truncate=False)
texts_ngrams.select('3-grams').show(6, truncate=False)
texts_ngrams.select('4-grams').show(6, truncate=False)

* Add the number of words column

In [None]:
# Transform data
len_udf = udf(lambda x: len(x), IntegerType())

d5 = d4.withColumn("no_of_words", len_udf("words"))

In [None]:
d5.show(5)

##### **POS (part-of-speech) tagging**

It is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having a form (word, tag)). The tag is a part-of-speech tag and signifies whether the word is a noun, adjective, verb, and so on.

To know more about POS tagging click [here](https://cdn.exec.talentsprint.com/static/cds/content/M7_AST5_pos.pdf).

In [None]:
# Download dependencies for pos_tag()
nltk.download('averaged_perceptron_tagger_eng')

In [None]:
## define schema for returned result from the udf function
## the returned result is a list of tuples
schema = ArrayType(StructType([
            StructField('f1', StringType()),
            StructField('f2', StringType())    ]))

sent_to_tag_words_udf = udf(lambda x: nltk.pos_tag(x), schema)

In [None]:
# Transform data
d6 = d5.withColumn("tagged_words", sent_to_tag_words_udf("words"))
d6.show(5)

##### **Frequency Distribution Plot**

It gives us information about the number of times a word has occurred within a sentence.

In [None]:
from nltk.probability import FreqDist

row = d6.select('words').toPandas().iloc[0,0]
fd = FreqDist(row)
fd.plot(30, cumulative= False)
plt.show()

From the above plot it can be seen that in the first row, the word 'Man' has occurred twice.

##### **Chunking**
Chunking is the process of grouping similar words together based on the nature of the word. It is the process of segmenting and labeling multitokens. Let's see how to do a noun phrase chunking on the tagged words data frame from the previous step.

First we need to define a udf function that chunks noun phrases from a list of pos-tagged words.

In [None]:
# define a udf function to chunk noun phrases from pos-tagged words
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunk_parser = nltk.RegexpParser(grammar)
chunk_parser_udf = udf(lambda x: str(chunk_parser.parse(x)), StringType())

In [None]:
# Transform data
d7 = d6.withColumn("NP_chunk", chunk_parser_udf("tagged_words"))

In [None]:
d7.select('NP_chunk').show(1, truncate= False)

In [None]:
# Function to pretty-print chunks
def pretty_print_tree(chunk):
    try:
        tree = Tree.fromstring(chunk)  # Parse the chunk as an NLTK tree
        return tree.pformat(margin=200)  # Pretty print with a margin
    except Exception as e:
        return str(e)

# UDF to apply pretty-printing
pretty_print_udf = udf(pretty_print_tree, StringType())

# Apply UDF to create a new column with pretty-printed trees
d7_pretty = d7.withColumn("Pretty_Tree", pretty_print_udf(d7["NP_chunk"]))

row=d7_pretty.collect()[0]
print(row['Pretty_Tree'])

#### Load data

**Use Parquet file to store data**

In [None]:
d7.write.format("parquet").mode("overwrite").save("transformed_text_parquet_data")

**Read data from Parquet file**

In [None]:
df_text_parquet = spark.read.format("parquet").load("transformed_text_parquet_data")

In [None]:
df_text_parquet.show(5)

**Store the data as a `json file`**

In [None]:
d7.write.format("json").mode("overwrite").save('transformed_text_json_data.json')

**Read data from `json` to spark dataframe**

In [None]:
df_text_json = spark.read.format("json").load('transformed_text_json_data.json')

In [None]:
df_text_json.show(5)