## Text Analysis
In this lab, you will create a classification model that performs sentiment analysis of tweets.
### Import Spark SQL and Spark ML Libraries

First, import the libraries you will need:

In [1]:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer, StopWordsRemover}

### Load Source Data
Now load the tweets data into a DataFrame. This data consists of tweets that have been previously captured and classified as positive or negative.

In [2]:
val wkdir ="file:///mnt/c/Users/Adura/Google Drive/Projects/Jupyter/SparkMs/data/"
val tweets_csv = spark.read.option("inferSchema","true").option("header", "true").csv(wkdir + "raw-flight-data.csv")
tweets_csv.show(truncate = false)

+----------+---------+-------+---------------+-------------+--------+--------+
|DayofMonth|DayOfWeek|Carrier|OriginAirportID|DestAirportID|DepDelay|ArrDelay|
+----------+---------+-------+---------------+-------------+--------+--------+
|19        |5        |DL     |11433          |13303        |-3      |1       |
|19        |5        |DL     |14869          |12478        |0       |-8      |
|19        |5        |DL     |14057          |14869        |-4      |-15     |
|19        |5        |DL     |15016          |11433        |28      |24      |
|19        |5        |DL     |11193          |12892        |-6      |-11     |
|19        |5        |DL     |10397          |15016        |-1      |-19     |
|19        |5        |DL     |15016          |10397        |0       |-1      |
|19        |5        |DL     |10397          |14869        |15      |24      |
|19        |5        |DL     |10397          |10423        |33      |34      |
|19        |5        |DL     |11278          |10397 

### Prepare the Data
The features for the classification model will be derived from the tweet text. The label is the sentiment (1 for positive, 0 for negative)

In [3]:
val data = tweets_csv.select($"SentimentText", $"Sentiment".cast("Int").alias("label"))
data.show(truncate = false)

Name: Compile Error
Message: <console>:32: error: value $ is not a member of StringContext
       val data = tweets_csv.select($"SentimentText", $"Sentiment".cast("Int").alias("label"))
                                    ^
<console>:32: error: value $ is not a member of StringContext
       val data = tweets_csv.select($"SentimentText", $"Sentiment".cast("Int").alias("label"))
                                                      ^
StackTrace: 

### Split the Data
In common with most classification modeling processes, you'll split the data into a set for training, and a set for testing the trained model.

In [4]:
val splits = data.randomSplit(Array(0.7, 0.3))
val train = splits(0)
val test = splits(1).withColumnRenamed("label", "trueLabel")
val train_rows = train.count()
val test_rows = test.count()
println("Training Rows: " + train_rows + " Testing Rows: " + test_rows)

Name: Compile Error
Message: <console>:26: error: not found: value data
       val splits = data.randomSplit(Array(0.7, 0.3))
                    ^
StackTrace: 

### Define the Pipeline
The pipeline for the model consist of the following stages:
- A Tokenizer to split the tweets into individual words.
- A StopWordsRemover to remove common words such as "a" or "the" that have little predictive value.
- A HashingTF class to generate numeric vectors from the text values.
- A LogisticRegression algorithm to train a binary classification model.

In [5]:
val tokenizer = new Tokenizer().setInputCol("SentimentText").setOutputCol("SentimentWords")
val swr = new StopWordsRemover().setInputCol(tokenizer.getOutputCol).setOutputCol("MeaningfulWords")
val hashTF = new HashingTF().setInputCol(swr.getOutputCol).setOutputCol("features")
val lr = new LogisticRegression().setLabelCol("label").setFeaturesCol("features").setMaxIter(10).setRegParam(0.01)
val pipeline = new Pipeline().setStages(Array(tokenizer, swr, hashTF, lr))

### Run the Pipeline as an Estimator
The pipeline itself is an estimator, and so it has a **fit** method that you can call to run the pipeline on a specified DataFrame. In this case, you will run the pipeline on the training data to train a model. 

In [6]:
val piplineModel = pipeline.fit(train)
println("Pipeline complete!")

Name: Compile Error
Message: <console>:36: error: not found: value train
       val piplineModel = pipeline.fit(train)
                                       ^
StackTrace: 

### Test the Pipeline Model
The model produced by the pipeline is a transformer that will apply all of the stages in the pipeline to a specified DataFrame and apply the trained model to generate predictions. In this case, you will transform the **test** DataFrame using the pipeline to generate label predictions.

In [7]:
val prediction = piplineModel.transform(test)
val predicted = prediction.select("SentimentText", "features", "prediction", "trueLabel")
predicted.show(100, truncate = false)

Name: Compile Error
Message: <console>:26: error: not found: value piplineModel
       val prediction = piplineModel.transform(test)
                        ^
<console>:26: error: not found: value test
       val prediction = piplineModel.transform(test)
                                               ^
StackTrace: 