# Text Processing
All tasks are words counting. It is find the main challenge is pre-processing the raw data into normal and clean form.

1. Spark Session:<br\>
sparkSession will give a context as well called "sparkSession.sparkContext". The default sparkSession is "spark", and sparkContext is "sc".<br\><br\>
2. Read data:<br\>
txt file:<br\>
```
val textRDD = sparkSession.sparkContext.textFile("files/TaleOfTwoCities.txt")
val textDS = sparkSession.read.textFile("C:/Users/Matthew/Desktop/TaleOfTwoCities.txt")
val textDF = sparkSession.read.text("C:/Users/Matthew/Desktop/TaleOfTwoCities.txt")
val raw1RDD = sparkSession.sparkContext.textFile("files/docword.enron.txt.gz") // only one partition
```
csv file:<br\>
```
val raw1RDD = sparkSession.sparkContext.textFile("files/CensusIncomeData.csv", 1)
val raw1DF = sparkSession.read.
    format("csv").
    option("header", "false").
    load("files/Data.csv") // only one partition
```
3. Access into first element of data and check length:<br\>
List or Seq: a(0), a.length<br\>
Array: a(0), a.length<br\>
Tuple: a.\_1<br\>
Row: a.get(0) or a(0) (Type Any)<br\>
    a.getDouble(0) (Type Double)<br\>
    a(0).asInstanceOf\[Double\] (Type Double)<br\>
    a.length<br\><br\>
4. Check number of partitions:<br\>
```
dataRDD.partitions.size
dataDF.rdd.partitions.size
```

In [1]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.feature.{Tokenizer, StopWordsRemover}
import org.apache.spark.sql.Row
val sparkSession = SparkSession.builder.
    master("local[4]").
    appName("Text Processing").
    getOrCreate()

Intitializing Scala interpreter ...

Spark Web UI available at http://143.167.112.136:4042
SparkContext available as 'sc' (version = 2.2.0, master = local[*], app id = local-1522079562541)
SparkSession available as 'spark'


import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.feature.{Tokenizer, StopWordsRemover}
import org.apache.spark.sql.Row
sparkSession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@6da3598d


## Task 1: pipeline
1. Read txt file into RDD.
2. Use "pipeline" to change the capital words into small, and remove the stop words.
3. Use "explode" to remove the empty row in DataFrame.
4. Use "withColumn" to append a column, use "agg" to append a column with "groupBy"

In [2]:
// Read data
val rawRDD = sparkSession.sparkContext.textFile("files/AnimalFarmChap1.txt", 20)
val wordsDF = rawRDD.flatMap(_.split("\\W").filter(_!="")).toDF("words")

// Build pipeline
val tokenizer = new Tokenizer().
    setInputCol("words").
    setOutputCol("smallWords")

val remover = new StopWordsRemover().
    setInputCol("smallWords").
    setOutputCol("smallNoStopWords")

val pipeline = new Pipeline().
    setStages(Array(tokenizer, remover))

val model = pipeline.fit(wordsDF)
val pipedDF = model.transform(wordsDF)

// explode used for flatten data. But here, we used it to remove empty cell
val dataDF = pipedDF.withColumn("explodedWords", explode($"smallNoStopWords")) // "withColumn" is used to add column, so do agg. But "agg" used with "groupBy"
val mostFreqDF = dataDF.groupBy("explodedWords").agg(count("*") as "TopFreqWords").orderBy(desc("TopFreqWords")) // "count" can automatically add a column named by count

// Count words which is at least 6 characters
val words6DF = rawRDD.flatMap(_.split("\\W").filter(_!="").filter(_.length >= 6)).toDF("words")
val piped6DF = model.transform(words6DF) // model do not need to retrain, as there is only transformer in pipeline
val data6DF = piped6DF.withColumn("explodedWords", explode($"smallNoStopWords"))
val mostFreq6DF = data6DF.groupBy("explodedWords").agg(count("*") as "TopFreqWords").orderBy("TopFreqWords")

rawRDD: org.apache.spark.rdd.RDD[String] = files/AnimalFarmChap1.txt MapPartitionsRDD[1] at textFile at <console>:31
wordsDF: org.apache.spark.sql.DataFrame = [words: string]
tokenizer: org.apache.spark.ml.feature.Tokenizer = tok_ed1e9b186e09
remover: org.apache.spark.ml.feature.StopWordsRemover = stopWords_d37722f00621
pipeline: org.apache.spark.ml.Pipeline = pipeline_f022bf73670a
model: org.apache.spark.ml.PipelineModel = pipeline_f022bf73670a
pipedDF: org.apache.spark.sql.DataFrame = [words: string, smallWords: array<string> ... 1 more field]
dataDF: org.apache.spark.sql.DataFrame = [words: string, smallWords: array<string> ... 2 more fields]
mostFreqDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [explodedWords: string, TopFreqWords: bigint]
words6DF: org.apache.spark.s...

## Task 2:
1. Read csv file into RDD
2. Remove the last incomplete element.

In [3]:
val raw1RDD = sparkSession.sparkContext.textFile("files/CensusIncomeData.csv", 1)
val fileEnd = raw1RDD.count.toInt // convert long into int for arithmetic operation
val rawRDD = raw1RDD.mapPartitions(_.take(fileEnd-1)).repartition(20) // remove the last empty line.
val dataRDD = rawRDD.map(_.split(",").map(_.trim))
val dataDF = dataRDD.map(_(3)).toDF("degree")
dataDF.groupBy("degree").count.show(1)

+-------+-----+
| degree|count|
+-------+-----+
|Masters| 1341|
+-------+-----+
only showing top 1 row



raw1RDD: org.apache.spark.rdd.RDD[String] = files/CensusIncomeData.csv MapPartitionsRDD[5] at textFile at <console>:39
fileEnd: Int = 25776
rawRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at repartition at <console>:41
dataRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[11] at map at <console>:42
dataDF: org.apache.spark.sql.DataFrame = [degree: string]


## Task 3:
1. Read csv into DataFrame directly. Only 1 partition.
2. Get the element in Row.

In [4]:
// Read data direct into DataFrame. This way avoid error "ArrayIndexOutOfBoundsException" caused by the last row do not contain "degree".
// The "degree" in the last line is filled with "null".
val raw1DF = sparkSession.read.format("csv").
    option("header", "false").
    load("files/CensusIncomeData.csv")
val dataDF = raw1DF.select("_c3").repartition(20)
val sudDataDF = dataDF.filter(x => (x.get(0)==" Masters") || (x.get(0)==" 10th")) // x is Row, use "get" to obtaint the element in Row.
val countDF = sudDataDF.groupBy("_c3").count
countDF.show

+--------+-----+
|     _c3|count|
+--------+-----+
|    10th|  753|
| Masters| 1341|
+--------+-----+



raw1DF: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 13 more fields]
dataDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_c3: string]
sudDataDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_c3: string]
countDF: org.apache.spark.sql.DataFrame = [_c3: string, count: bigint]


## Task 4:
1. Read gz file. Only one partition.
2. Convert RDD into tuple type, and the number of elements in tuple will be the number of DF columns.

In [5]:
val raw1RDD = sparkSession.sparkContext.textFile("files/docword.enron.txt.gz", 1)
val rawRDD = raw1RDD.mapPartitions(_.drop(3)).repartition(20)
val dataRDD = rawRDD.map(_.split("\\W")).map(x => (x(1), x(2))) // to converted into two columns DataFrame, we need tuple whose first element is accessed by x._1
val dataDF = dataRDD.toDF("wordID", "count")
dataDF.groupBy("wordID").count.show

+------+-----+
|wordID|count|
+------+-----+
| 18130|  795|
| 14204|  802|
| 26005|   71|
| 23843| 1505|
| 17686| 2619|
| 23097|   52|
| 12394|  138|
| 11888|  422|
| 24269|  494|
| 14369|   23|
| 20569|  607|
| 21452|   83|
| 14157|  229|
| 25555|  645|
| 14887|  877|
| 23459|   77|
|  1159|  430|
| 25032|  250|
|   467|  174|
| 18726|  138|
+------+-----+
only showing top 20 rows



raw1RDD: org.apache.spark.rdd.RDD[String] = files/docword.enron.txt.gz MapPartitionsRDD[37] at textFile at <console>:38
rawRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[42] at repartition at <console>:39
dataRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[44] at map at <console>:40
dataDF: org.apache.spark.sql.DataFrame = [wordID: string, count: string]
