# Lab 4 - Sentiment Analysis for Yelp Reviews

In Lab 2, we looked at how Spark SQL could help us develop useful features for merging and processing our data in HDFS. In Lab 3, we trained a few regression algorithms using Spark ML pipelines and used cross-validation for hyperparameter tuning. The focus of this lab is natural language processing with Spark ML, with the downstream goal of creating sentiment classifiers on the reviews data.

In [None]:
%%configure
{ "name":"SparkSQL_Lab", 
  "executorMemory": "8G", 
  "executorCores": 2, 
  "numExecutors": 20, 
  "driverCores": 2
}

## Reading Data from a Different Storage Container

By default, all storage containers have access to each other within a storage account. In contract, for ADLS, you have to specify the ACLs at the directory level for each cluster. In the chunk below, we'll create two variables containing our container name and the storage account name, and concatenate those strings together to create a path for our data directory in our storage container. 

In [None]:
var container = "wasb://azmaybach-2017-08-28"
var storageaccount = "@azaidihdi.blob.core.windows.net/yelp/data/"

var full_url = container.concat(storageaccount)
println(full_url)

In [None]:
var biz_path = full_url.concat("yelp_academic_dataset_business.json")
var reviews_path = full_url.concat("yelp_academic_dataset_review.json")

val business = spark.read.json(biz_path)
val reviews = spark.read.json(reviews_path)

In [None]:
println("Number of records in reviews table: " + reviews.count())

In [None]:
println("Number of records in businesses table: " + business.count())

In [None]:
business.printSchema()

In [None]:
business.show(1)

In [None]:
val biz_names = Seq("business_id", "name", "city", "stars", "state",
                    "categories", "attributes", "address", "review_count")
val biz = business.select(biz_names.map(c => col(c)): _*)

In [None]:
biz.show()

In [None]:
val biz_star = business.withColumnRenamed("stars", "ave_stars")

In [None]:
biz_star.createOrReplaceTempView("business")

In [None]:
%%sql
select * from business limit 10

In [None]:
reviews.cache()
biz_star.cache()

## Merge the Datasets

As we described in Lab 2, we will merge in the two tables to create a final table we will use for modeling.

In [None]:
val biz_reviews = biz_star.join(reviews, 
                                biz_star.col("business_id") === reviews.col("business_id"), 
                                "left_outer")

In [None]:
biz_reviews.explain(true)

In [None]:
biz_reviews.cache()
biz_reviews.createOrReplaceTempView("joinedReviews")

In [None]:
%%sql
select * from joinedReviews limit 10

## Tokenizing Text

Tokenization is the process of converting text data into an vector of "tokens" of individual components, usually words.

### Exercise 1: Tokenize the reviews data

1. Use the [Tokenizer](https://spark.apache.org/docs/latest/ml-features.html#tokenizer) class to convert the reviews column into an array of tokens
    - [Tokenizer docs](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.Tokenizer)
2. Use the [StopWordsRemover](https://spark.apache.org/docs/latest/ml-features.html#stopwordsremover) to remove stop words from your tokens array

## Exercise 2

Build a SparkML Pipeline for training a sentiment classifier. You'll use the tokenizer you created earlier and the add pipeline stages for additional feature variables and the estimator for the classifier module. Split your data into train and test splits and then calculate your classifier's AUC.

Useful modules:
* `org.apache.spark.ml.feature.{VectorAssembler, HashingTF, IDF, Tokenizer, Binarizer}`
* `import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics`

## Pipeline