## Part 2 Datasets/DataFrames: Spark ML and Pipelines
Convert the review texts to a classic vector space representation with TFIDF-weighted features based on the Spark DataFrame/Dataset API by building a transformation pipeline. The primary goal of this part is the preparation of the pipeline for Part 3 (see below). Note: although parts of this pipeline will be very similar to Assignment 1 or Part 1 above, do not expect to obtain identical results or have access to all intermediate outputs to compare the individual steps.

Use built-in functions for tokenization to unigrams at whitespaces, tabs, digits, and the delimiter characters ()[]{}.!?,;:+=-_"'`~#@&*%€$§\/, casefolding, stopword removal, TF-IDF calculation, and chi square selection ) (using 2000 top terms overall). Write the terms selected this way to a file output_ds.txt and compare them with the terms selected in Assignment 1. Describe your observations briefly in the submission report (see Part 3).

[Provided link for ML pipeline](https://spark.apache.org/docs/latest/ml-pipeline.html)  
[Provided link for feature extraction](https://spark.apache.org/docs/latest/ml-features.html)

In [2]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

from pyspark.ml import Pipeline
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors

from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import UnivariateFeatureSelector, ChiSqSelectorModel 



In [3]:
# Initialize Spark context and session
conf = SparkConf().setAppName("Part2")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
24/05/17 14:51:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/05/17 14:51:48 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/05/17 14:51:48 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
24/05/17 14:51:48 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
24/05/17 14:51:48 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
24/05/17 14:51:4

In [4]:
spark

### Importing data
Stopwords file as well as test data

In [5]:
# Define the stopwords file and the counters file
stopwords_file = "stopwords.txt"

# Load stopwords into a set
with open(stopwords_file, "r") as f:
    stopwords = set(f.read().strip().split())
    
# Load and preprocess the Amazon reviews dataset
input_file = "hdfs:///user/dic24_shared/amazon-reviews/full/reviews_devset.json"
reviews_df = spark.read.json(input_file)

                                                                                

In [6]:
reviews_df = reviews_df.select("category", "reviewText")

# Show the DataFrame with selected columns
reviews_df.show()

+--------------------+--------------------+
|            category|          reviewText|
+--------------------+--------------------+
|Patio_Lawn_and_Garde|This was a gift f...|
|Patio_Lawn_and_Garde|This is a very ni...|
|Patio_Lawn_and_Garde|The metal base wi...|
|Patio_Lawn_and_Garde|For the most part...|
|Patio_Lawn_and_Garde|This hose is supp...|
|Patio_Lawn_and_Garde|This tool works v...|
|Patio_Lawn_and_Garde|This product is a...|
|Patio_Lawn_and_Garde|I was excited to ...|
|Patio_Lawn_and_Garde|I purchased the L...|
|Patio_Lawn_and_Garde|Never used a manu...|
|Patio_Lawn_and_Garde|Good price. Good ...|
|Patio_Lawn_and_Garde|I have owned the ...|
|Patio_Lawn_and_Garde|I had "won" a sim...|
|Patio_Lawn_and_Garde|The birds ate all...|
|Patio_Lawn_and_Garde|Bought last summe...|
|Patio_Lawn_and_Garde|I knew I had a mo...|
|Patio_Lawn_and_Garde|I was a little wo...|
|Patio_Lawn_and_Garde|I have used this ...|
|Patio_Lawn_and_Garde|I actually do not...|
|Patio_Lawn_and_Garde|Just what 

In [6]:
# Show first two objects of reviews_rdd
reviews_df.take(2)

[Row(category='Patio_Lawn_and_Garde', reviewText="This was a gift for my other husband.  He's making us things from it all the time and we love the food.  Directions are simple, easy to read and interpret, and fun to make.  We all love different kinds of cuisine and Raichlen provides recipes from everywhere along the barbecue trail as he calls it. Get it and just open a page.  Have at it.  You'll love the food and it has provided us with an insight into the culture that produced it. It's all about broadening horizons.  Yum!!"),
 Row(category='Patio_Lawn_and_Garde', reviewText='This is a very nice spreader.  It feels very solid and the pneumatic tires give it great maneuverability and handling over bumps.  The control arm is solid metal, not a cable, which gives you precise control and will last a long time.  The settings take some experimentation with your various products to get it right, but that is true of any spreader.  It has good distribution... probably flings material a little 

In [8]:
pattern = r'\s+|\d+|[(){}\[\].!?,;:+=_"\'`~#@&*%€$§\\/\-]'
regexTokenizer = RegexTokenizer(inputCol="reviewText", outputCol="words", pattern=pattern)

In [8]:
reviews_df = regexTokenizer.transform(reviews_df)

In [9]:
reviews_df.columns

['category', 'reviewText', 'words']

In [10]:
reviews_df.take(1)

                                                                                

[Row(category='Patio_Lawn_and_Garde', reviewText="This was a gift for my other husband.  He's making us things from it all the time and we love the food.  Directions are simple, easy to read and interpret, and fun to make.  We all love different kinds of cuisine and Raichlen provides recipes from everywhere along the barbecue trail as he calls it. Get it and just open a page.  Have at it.  You'll love the food and it has provided us with an insight into the culture that produced it. It's all about broadening horizons.  Yum!!", words=['this', 'was', 'a', 'gift', 'for', 'my', 'other', 'husband', 'he', 's', 'making', 'us', 'things', 'from', 'it', 'all', 'the', 'time', 'and', 'we', 'love', 'the', 'food', 'directions', 'are', 'simple', 'easy', 'to', 'read', 'and', 'interpret', 'and', 'fun', 'to', 'make', 'we', 'all', 'love', 'different', 'kinds', 'of', 'cuisine', 'and', 'raichlen', 'provides', 'recipes', 'from', 'everywhere', 'along', 'the', 'barbecue', 'trail', 'as', 'he', 'calls', 'it', '

In [11]:
type(stopwords)

set

In [9]:
remover = StopWordsRemover(stopWords = list(stopwords), inputCol="words", outputCol="filtered_words")


In [None]:
reviews_df = remover.transform(reviews_df)

Add HashingTF HashingTermFrequency

In [10]:
#hashingTF = HashingTF(inputCol="filtered_words", outputCol="rawFeatures")
countV = CountVectorizer(inputCol="filtered_words", outputCol="rawFeatures")


In [None]:
countModel = countV.fit(reviews_df)
reviews_df = countModel.transform(reviews_df)
# Maybe use CountVectorizer instead? https://spark.apache.org/docs/latest/ml-features.html#tf-idf

In [28]:
countModel.vocabulary[:10]

['great',
 'good',
 'love',
 'time',
 'work',
 'recommend',
 'back',
 'easy',
 'make',
 'bought']

In [15]:
reviews_df.take(1)

24/05/17 14:44:42 WARN DAGScheduler: Broadcasting large task binary with size 1057.6 KiB
                                                                                

[Row(category='Patio_Lawn_and_Garde', reviewText="This was a gift for my other husband.  He's making us things from it all the time and we love the food.  Directions are simple, easy to read and interpret, and fun to make.  We all love different kinds of cuisine and Raichlen provides recipes from everywhere along the barbecue trail as he calls it. Get it and just open a page.  Have at it.  You'll love the food and it has provided us with an insight into the culture that produced it. It's all about broadening horizons.  Yum!!", words=['this', 'was', 'a', 'gift', 'for', 'my', 'other', 'husband', 'he', 's', 'making', 'us', 'things', 'from', 'it', 'all', 'the', 'time', 'and', 'we', 'love', 'the', 'food', 'directions', 'are', 'simple', 'easy', 'to', 'read', 'and', 'interpret', 'and', 'fun', 'to', 'make', 'we', 'all', 'love', 'different', 'kinds', 'of', 'cuisine', 'and', 'raichlen', 'provides', 'recipes', 'from', 'everywhere', 'along', 'the', 'barbecue', 'trail', 'as', 'he', 'calls', 'it', '

In [11]:
idf = IDF(inputCol="rawFeatures", outputCol="features")


In [None]:
idfModel = idf.fit(reviews_df)
reviews_df = idfModel.transform(reviews_df)

In [17]:
reviews_df.take(1)

24/05/17 14:44:50 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB


[Row(category='Patio_Lawn_and_Garde', reviewText="This was a gift for my other husband.  He's making us things from it all the time and we love the food.  Directions are simple, easy to read and interpret, and fun to make.  We all love different kinds of cuisine and Raichlen provides recipes from everywhere along the barbecue trail as he calls it. Get it and just open a page.  Have at it.  You'll love the food and it has provided us with an insight into the culture that produced it. It's all about broadening horizons.  Yum!!", words=['this', 'was', 'a', 'gift', 'for', 'my', 'other', 'husband', 'he', 's', 'making', 'us', 'things', 'from', 'it', 'all', 'the', 'time', 'and', 'we', 'love', 'the', 'food', 'directions', 'are', 'simple', 'easy', 'to', 'read', 'and', 'interpret', 'and', 'fun', 'to', 'make', 'we', 'all', 'love', 'different', 'kinds', 'of', 'cuisine', 'and', 'raichlen', 'provides', 'recipes', 'from', 'everywhere', 'along', 'the', 'barbecue', 'trail', 'as', 'he', 'calls', 'it', '

To implement the Chi Square selector, we first need to convert the categories into numerics

In [12]:
# Apply StringIndexer to convert categorical column to numerical
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")


In [None]:
reviews_df = indexer.fit(reviews_df).transform(reviews_df)

In [19]:
reviews_df.select('categoryIndex').show()

+-------------+
|categoryIndex|
+-------------+
|         18.0|
|         18.0|
|         18.0|
|         18.0|
|         18.0|
|         18.0|
|         18.0|
|         18.0|
|         18.0|
|         18.0|
|         18.0|
|         18.0|
|         18.0|
|         18.0|
|         18.0|
|         18.0|
|         18.0|
|         18.0|
|         18.0|
|         18.0|
+-------------+
only showing top 20 rows



In [20]:
chiSelector = ChiSqSelector(numTopFeatures=2000, outputCol="selectedFeatures2", featuresCol="features", labelCol="categoryIndex")
model = chiSelector.fit(reviews_df)


24/05/17 14:44:53 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB
24/05/17 14:44:53 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB
24/05/17 14:45:01 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB
                                                                                

In [21]:
model.transform(reviews_df).select('selectedFeatures2').show(1, truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|selectedFeatures2                                                                                                                                                                                                                                                                                                                               |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

24/05/17 14:45:25 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB


In [13]:
selector = UnivariateFeatureSelector(
    featuresCol="features",outputCol="selectedFeatures",
    labelCol="categoryIndex",  selectionMode="numTopFeatures",
)
selector.setFeatureType("categorical").setLabelType("categorical").setSelectionThreshold(2000)




UnivariateFeatureSelector_156a50f66587

In [None]:
result = selector.fit(reviews_df)

In [23]:
# Extract the chi-square values and feature indices
selected_features = result.selectedFeatures
#selected_chi_square_values = model._java_obj.chiSqSelectorModel().selectedStats().values()
result.transform(reviews_df).select('selectedFeatures').show(1, truncate=False)

24/05/17 14:45:56 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB


+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|selectedFeatures                                                                                                                                                                                                                                                                                                                                |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [14]:
pipeline = Pipeline(stages=[regexTokenizer, remover, countV, idf, indexer, selector])

In [15]:
model = pipeline.fit(reviews_df)

24/05/17 14:54:44 WARN DAGScheduler: Broadcasting large task binary with size 1059.7 KiB
24/05/17 14:54:54 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB
24/05/17 14:54:54 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB
24/05/17 14:55:01 WARN DAGScheduler: Broadcasting large task binary with size 2.5 MiB
                                                                                

In [16]:
result = model.transform(reviews_df)

In [21]:
result.selectedFeatures

Column<'selectedFeatures'>