# Using Spark

This tutorial is using the python API for spark. The original spark is written in Scala. APIs in Java and R are also available, and should definitely be checked out if it better suits your preferences. R in particular has some pretty good developments going on for both SparkR and Sparklyr packages. The advantage PySpark has over these APIs is the large number of modules and functionalities available, and the compatibility with many python libraries, which allows flexible use of user-defined functions. If you want the best performance and control over your spark program, however, learning Scala and coding Spark in Scala is highly recommended.

This tutorial is a piece of working code in PySpark, which should give you the basic functionalities and syntax of the program. The underlying mechanism for Spark and Hadoop will be something that we cover in the meetings and the Data Engineering workshops.

In [1]:
# Sometimes your python can't find the pyspark package because of some path complications.
# A handy package in python called findspark makes sure this doesn't happen, and 
# can be used as shown below.
import findspark
findspark.init()

Make sure you have hadoop running in your environment. A handy way to check is the bash command jps. jps should ideally show something similar to the output of this next box, whlch executes the bash code. (in general, putting a ! at the beginning of a line in jupyter notebook means it's executing bash commands)

In [2]:
!jps

8193 NameNode
8388 DataNode
8887 ResourceManager
8647 SecondaryNameNode
9241 NodeManager
23355 Jps
32510 launcher.jar


If your output for the previous one doesn't show the hadoop daemons running, run the following code box, which will start the relevant daemons (assuming you have installed hadoop and spark correctly)

In [None]:
%%bash
start-all.sh
jps

Hopefully now you have the jps output showing the daemons running. If you see some error message or the daemons are still missing (namenode, datanode, secondarynamenode, resourcemanager, nodemanager) then you should make sure that you have correctly installed hadoop and spark.

Now that we have the prerequisites done, let's dive into how to start an instance of spark. We will instantiate a SparkSession, and will specify the master as YARN. This means that spark will use YARN as its scheduler. This is especially useful if you are working with the hadoop filesystem as your backend. (our main filesystem in CDS is a hadoop file system.

__note: please make sure that the number of executor instances (configuration named "spark.executor.instances") does not exceed the number of cores on your computer - 1. Change your spark.executor.memory to 1g and driver memory to 2g to make sure your computer doesn't crash while running this notebook.__

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession \
        .builder \
        .master("yarn") \
        .appName("tests") \
        .config("spark.executor.instances", "50") \
        .config("spark.executor.memory","5g") \
        .config("spark.driver.memory","10g") \
        .config("spark.executor.cores",'1') \
        .config("spark.scheduler.mode","FIFO") \
        .getOrCreate()

sc = spark.sparkContext

The next step is to load a file to use. In our case, we'll use the review.json and business.json. These files will be provided separately. But make sure to have it in the hdfs using the hadoop fs -put (local directory of file) (target hdfs directory) command. An example of how you put files into the hdfs directory is shown below.

In [17]:
%%bash
# check that your file exists in some local directory. You are free to change this line to any directory
# that you have the file stored in (usually ~/Downloads/review.json, but I'll leave it up to you)
ls ~/Yelp/dataset/review.json ~/Yelp/dataset/business.json

# now create a folder in your hdfs called first_folder
hadoop fs -mkdir /first_folder

# transfer local files into your hdfs. Again adjust the first directory in the command to the actual 
# directory of the review.json and business.json files
hadoop fs -put ~/Yelp/dataset/review.json /first_folder/
hadoop fs -put ~/Yelp/dataset/business.json /first_folder/

# display the contents of the /first_folder/ directory in your hdfs
hadoop fs -ls /first_folder/

/home/hduser1/Yelp/dataset/business.json
/home/hduser1/Yelp/dataset/review.json
Found 2 items
-rw-r--r--   5 hduser1 supergroup  132272455 2018-03-07 16:18 /first_folder/business.json
-rw-r--r--   5 hduser1 supergroup 3819730722 2018-03-07 16:18 /first_folder/review.json


mkdir: `/first_folder': File exists


Now that you have the dataset let's load it to the spark workspace. 

__please make sure to change the repartition() function argument to 2x the number of CPU cores (which is usually 4) on your laptop. This code was written to be executable on the servers, which has 72 cores currently__

In [4]:
# load the data from the hdfs directory, and then repartition the data
# the repartitioning chops the dataset (which at the hdfs should be divided
# to 29 partitions) to be processed by all your cores. 
reviews = spark.read.json('/first_folder/review.json').repartition(150)
business = spark.read.json('/first_folder/business.json').repartition(150)

# let's print out the schema of the dataset to get an understanding of how the data
# is structured and the data type.
reviews.printSchema()
business.printSchema()

root
 |-- business_id: string (nullable = true)
 |-- cool: long (nullable = true)
 |-- date: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- review_id: string (nullable = true)
 |-- stars: long (nullable = true)
 |-- text: string (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)

root
 |-- address: string (nullable = true)
 |-- attributes: struct (nullable = true)
 |    |-- AcceptsInsurance: boolean (nullable = true)
 |    |-- AgesAllowed: string (nullable = true)
 |    |-- Alcohol: string (nullable = true)
 |    |-- Ambience: struct (nullable = true)
 |    |    |-- casual: boolean (nullable = true)
 |    |    |-- classy: boolean (nullable = true)
 |    |    |-- divey: boolean (nullable = true)
 |    |    |-- hipster: boolean (nullable = true)
 |    |    |-- intimate: boolean (nullable = true)
 |    |    |-- romantic: boolean (nullable = true)
 |    |    |-- touristy: boolean (nullable = true)
 |    |    |-- trendy: boolean (n

So we have a good idea of what both datasets looks like. In this case, let's say we want to do a simple row count, which will tell us how many reviews there are. 

In [5]:
# row count for businesses
print(str(reviews.count()) + " reviews")

# row count for businesses
print(str(business.count()) + " businesses")

4736897 reviews
156639 businesses


SQL operations are extremely straightforward in spark, including selecting columns(projection), filtering by column values (selection) and joins, which will all be shown below. This is because the Spark DataFrame API largely revolves around SparkSQL. Most SQL commands can be done in functions that carry the same name.

In [6]:
print("original review data's schema")
reviews.printSchema()
# projection on columns for the review file
reviews_text = reviews.select('review_id','text')

print("Schema on review data with projection")
reviews_text.printSchema()

# selection on reviews file
reviews_funny = reviews.filter(reviews.funny > 20)
print(str(reviews_funny.count()) + " reviews that have more than 20 votes")

# Join the business_stars and reviews table
# on the businessID column
reviews_with_business = reviews.join(business, reviews.business_id == business.business_id,how='inner')

original review data's schema
root
 |-- business_id: string (nullable = true)
 |-- cool: long (nullable = true)
 |-- date: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- review_id: string (nullable = true)
 |-- stars: long (nullable = true)
 |-- text: string (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)

Schema on review data with projection
root
 |-- review_id: string (nullable = true)
 |-- text: string (nullable = true)

4334 reviews that have more than 20 votes


You can also do summary statistic operations, as well as finding out unique elements and counting how many there are. Please keep in mind that groupBy(), as is the case for most sql engines, are one of the costliest opeations in SparkSQL, and should be used only when they're needed.

Let us suppose that we want to predict the stars given the text data. To make this executable in normal laptops, let's do this only for the few thousand reviews that have been received more than 20 funny votes.

In order to use the review text to predict star ratings though, I need to first transform the text data into numeric representations. This is done with a word2vec transformation of the review texts. In other words, I would like to convert the string of texts to a vector that can represent each review. I need to several steps to achieve this goal.

1. Tokenize the text (convert it to a word - count pairs)
2. Remove all the stop words
3. Run the final versions into a word2vec model, which will then create a vector representing the "orientation" of the words in each review

Word2vec has the property that similar words are clumped together using a 2-layer neural network. We train the word2vec on the text data, and then will use a logistic regression model to map the relationship between the text and star ratings.

In [7]:
# import relevant packages from Spark
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover

# tokenizer 
tokenizer = Tokenizer(inputCol="text", outputCol="words")
DTMmatrix = tokenizer.transform(reviews_funny)

# Stop word removal
stopremove = StopWordsRemover(inputCol='words',outputCol='cleaned')
dat = stopremove.transform(DTMmatrix)

#fit a word2vec model and transform the data 
word2Vec = Word2Vec(vectorSize=100, minCount=0, numPartitions=150, inputCol="cleaned", outputCol="word2vec")
model = word2Vec.fit(dat)
result = model.transform(dat)

In [8]:
result.printSchema()

root
 |-- business_id: string (nullable = true)
 |-- cool: long (nullable = true)
 |-- date: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- review_id: string (nullable = true)
 |-- stars: long (nullable = true)
 |-- text: string (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- cleaned: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- word2vec: vector (nullable = true)



Now that we have created a word2vec transformation of the text, let's see if we can predict a star rating of a particular set of funny reviews based on its text. For demonstration purposes, we will use a logistic regression. You should try fitting other models and playing around with the model parameters. Look up the pyspark documentation to see what types of modules there are, in this [link](https://spark.apache.org/docs/2.2.0/ml-classification-regression.html)

In [9]:
from pyspark.ml.classification import LogisticRegression

# define my logistic regression model. 
# Use default values and give it the label and feature column names
logit = LogisticRegression(featuresCol = 'word2vec', labelCol='stars')

# let's do holdout validation on my model. This means I'll separate some part of the data as my "test" set
# (or more correctly the validation set) using the randomsplit method.
train, test = result.randomSplit([0.8,0.2])
logit_model = logit.fit(train)

# make predictions using the transform function
# a new column 'prediction' will be added to the test dataframe
test = logit_model.transform(test)
test.printSchema()

root
 |-- business_id: string (nullable = true)
 |-- cool: long (nullable = true)
 |-- date: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- review_id: string (nullable = true)
 |-- stars: long (nullable = true)
 |-- text: string (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- cleaned: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- word2vec: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)



We can also evaluate our models using evaluator objects. The default metric is the f1 score, which is a weighted accuracy measure used for multi-class settings (in this case the predicted variable is not a yes/no but rather a star system ranging from 1 to 5)

In [10]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol='stars')
evaluator.evaluate(test)

0.33435981133169324

We get a .33 f1 score, which is pretty low. Considering this is just using the text to predict the star ratings, it's expected that the accuracy of this model would not be high. This tutorial should, however, show you the basics of how spark scripts can be written. Please write a script that uses another ML algorithm. Note that for this particular script, I did not need to use a VectorAssembler object. 

If the inputs to a ML model is distributed over multiple columns (let's say different attributes of a review etc) then you need to use the VectorAssembler to create a new column that contains all these column values as one vector for each row. Please look this up and try to figure it out on your own.

For those who want to get ahead, one more thing you can learn is using what is known as a _pipeline_ in spark. These help speed up your operations considerably.