## Text Analysis
In this lab, you will create a classification model that performs sentiment analysis of tweets.
### Import Spark SQL and Spark ML Libraries

First, import the libraries you will need:

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.\
        builder.\
        appName("pyspark-notebook").\
        master("spark://spark-master:7077").\
        config("spark.executor.memory", "4098m").\
        getOrCreate()

In [2]:
!pip install numpy



In [3]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer, StopWordsRemover

### Load Source Data
Now load the tweets data into a DataFrame. This data consists of tweets that have been previously captured and classified as positive or negative.

In [4]:
tweets_csv = spark.read.csv('../data/tweets.csv', inferSchema=True, header=True)
tweets_csv.show(truncate = False)

+------+---------+---------------+-----------------------------------------------------------+
|ItemID|Sentiment|SentimentSource|SentimentText                                              |
+------+---------+---------------+-----------------------------------------------------------+
|1038  |1        |Sentiment140   |that film is fantastic #brilliant                          |
|1804  |1        |Sentiment140   |this music is really bad #myband                           |
|1693  |0        |Sentiment140   |winter is terrible #thumbs-down                            |
|1477  |0        |Sentiment140   |this game is awful #nightmare                              |
|45    |1        |Sentiment140   |I love jam #loveit                                         |
|246   |0        |Sentiment140   |I dislike skiing #rubbish                                  |
|776   |1        |Sentiment140   |I like pop music #toptastic                                |
|1666  |1        |Sentiment140   |this game is awf

### Prepare the Data
The features for the classification model will be derived from the tweet text. The label is the sentiment (1 for positive, 0 for negative)

In [5]:
data = tweets_csv.select("SentimentText", col("Sentiment").cast("Int").alias("label"))
data.show(truncate = False)

+-----------------------------------------------------------+-----+
|SentimentText                                              |label|
+-----------------------------------------------------------+-----+
|that film is fantastic #brilliant                          |1    |
|this music is really bad #myband                           |1    |
|winter is terrible #thumbs-down                            |0    |
|this game is awful #nightmare                              |0    |
|I love jam #loveit                                         |1    |
|I dislike skiing #rubbish                                  |0    |
|I like pop music #toptastic                                |1    |
|this game is awful good                                    |1    |
|rock music is terrible #worstever                          |0    |
|that movie is great #favorite                              |1    |
|I hate this game #fail                                     |0    |
|I dislike this game #thumbs-down               

### Split the Data
In common with most classification modeling processes, you'll split the data into a set for training, and a set for testing the trained model.

In [6]:
splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1].withColumnRenamed("label", "trueLabel")
train_rows = train.count()
test_rows = test.count()
print ("Training Rows:", train_rows, " Testing Rows:", test_rows)

Training Rows: 1315  Testing Rows: 617


### Define the Pipeline
The pipeline for the model consist of the following stages:
- A Tokenizer to split the tweets into individual words.
- A StopWordsRemover to remove common words such as "a" or "the" that have little predictive value.
- A HashingTF class to generate numeric vectors from the text values.
- A LogisticRegression algorithm to train a binary classification model.

In [7]:
tokenizer = Tokenizer(inputCol="SentimentText", outputCol="SentimentWords")
swr = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="MeaningfulWords")
hashTF = HashingTF(inputCol=swr.getOutputCol(), outputCol="features")
lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, swr, hashTF, lr])

### Run the Pipeline as an Estimator
The pipeline itself is an estimator, and so it has a **fit** method that you can call to run the pipeline on a specified DataFrame. In this case, you will run the pipeline on the training data to train a model.

In [8]:
piplineModel = pipeline.fit(train)
print ("Pipeline complete!")

Pipeline complete!


### Test the Pipeline Model
The model produced by the pipeline is a transformer that will apply all of the stages in the pipeline to a specified DataFrame and apply the trained model to generate predictions. In this case, you will transform the **test** DataFrame using the pipeline to generate label predictions.

In [9]:
prediction = piplineModel.transform(test)
predicted = prediction.select("SentimentText", "prediction", "trueLabel")
predicted.show(100, truncate = False)

+--------------------------------------+----------+---------+
|SentimentText                         |prediction|trueLabel|
+--------------------------------------+----------+---------+
|I adore cheese #favorite              |1.0       |1        |
|I adore cheese #thumbs-up             |1.0       |1        |
|I adore classical music #favorite     |1.0       |1        |
|I adore coffee #bestever              |1.0       |1        |
|I adore coffee #brilliant             |1.0       |1        |
|I adore coffee #favorite              |1.0       |1        |
|I adore coffee #toptastic             |1.0       |1        |
|I adore jam #brilliant                |1.0       |1        |
|I adore jam #favorite                 |1.0       |1        |
|I adore jam #loveit                   |1.0       |1        |
|I adore pop music #thumbs-up          |1.0       |1        |
|I adore rock music #loveit            |1.0       |1        |
|I adore rock music #toptastic         |1.0       |1        |
|I adore