<h1>Machine Learning Library (MLlib)</h1>

[MLlib](http://spark.apache.org/docs/latest/ml-guide.html) is Spark’s machine learning (ML) library. It provides:

- *ML Algorithms*: common learning algorithms such as classification, regression, clustering, and collaborative filtering
- *Featurization*: feature extraction, transformation, dimensionality reduction, and selection
- *Pipelines*: tools for constructing, evaluating, and tuning ML Pipelines
- *Persistence*: saving and load algorithms, models, and Pipelines
- *Utilities*: linear algebra, statistics, data handling, etc.

We carry out the usual settings, classpath and imports, this time including <tt>MLlib</tt>.

In [1]:
val sparkVersion = "2.0.1"
val scalaVersion = scala.util.Properties.versionNumberString

[36msparkVersion[0m: [32mString[0m = [32m"2.0.1"[0m
[36mscalaVersion[0m: [32mString[0m = [32m"2.11.8"[0m

In [2]:
classpath.add(
    "org.apache.spark" %% "spark-yarn" % sparkVersion,
    "org.apache.spark" %% "spark-mllib" % sparkVersion
)

147 new artifact(s)


147 new artifacts in macro
147 new artifacts in runtime
147 new artifacts in compile




In [3]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.util.MLUtils

// imports for the text document pipeline
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.feature.{Tokenizer, StopWordsRemover}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

[32mimport [36morg.apache.spark.sql.SparkSession[0m
[32mimport [36morg.apache.spark.mllib.util.MLUtils[0m
[32mimport [36morg.apache.spark.ml.{Pipeline, PipelineModel}[0m
[32mimport [36morg.apache.spark.ml.feature.{Tokenizer, StopWordsRemover}[0m
[32mimport [36morg.apache.spark.ml.classification.LogisticRegression[0m
[32mimport [36morg.apache.spark.ml.feature.{HashingTF, Tokenizer}[0m
[32mimport [36morg.apache.spark.ml.linalg.Vector[0m
[32mimport [36morg.apache.spark.sql.Row[0m

In [4]:
// Create Spark session
val sparkSession = SparkSession.builder
    .master("local[1]")
    .appName("Spark dataframes and datasets")
    .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/03/02 09:48:23 INFO SparkContext: Running Spark version 2.0.1
17/03/02 09:48:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/02 09:48:23 INFO SecurityManager: Changing view acls to: b97eec96efcb40779e247b002e047f82
17/03/02 09:48:23 INFO SecurityManager: Changing modify acls to: b97eec96efcb40779e247b002e047f82
17/03/02 09:48:23 INFO SecurityManager: Changing view acls groups to: 
17/03/02 09:48:23 INFO SecurityManager: Changing modify acls groups to: 
17/03/02 09:48:23 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(b97eec96efcb40779e247b002e047f82); groups with view permissions: Set(); users  with modify permissions: Set(b97eec96efcb40779e247b002e047f82); groups with modify permissions: Set()
17/03/02 09:48:24 INFO Utils: Successfully started service 

[36msparkSession[0m: [32mSparkSession[0m = org.apache.spark.sql.SparkSession@5d265c75

<tt>MLlib</tt> allows easy combination of numerous algorithms into a single pipeline using standardized APIs for machine learning algorithms. The key concepts are:

- **Dataframe**. Dataframes can hold a variety of data types.
- **Transformer**. Transforms one dataframe into another.
- **Estimator**. Algorithm which can be fit on a DataFrame to produce a Transformer.
- **Pipeline**. A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
- **Parameter**. Transformers and Estimators share a common API for specifying parameters.

More details on these below, and a list of some of the available ML features is available [here](http://spark.apache.org/docs/latest/ml-features.html).

<h2>Datasets and Dataframes</h2>

Along with the introduction of <tt>SparkSession</tt>, the <tt>resilient distributed dataset</tt> (RDD) was replaced by [dataset](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset). Again, these are objects which can be worked on in parallel. The available operations are:

- **transformations**: produce new datasets
- **actions**: computations which return results

We will start with creating dataframes and datasets, showing how we can print their contents. We create a dataframe in the cell below and print out some info (we can also modify the output before printing):

In [5]:
// create a dataframe based on the contents of a JSON file
val peopleDF = sparkSession.read.json("files/people.json")

peopleDF.show()

// Print the schema in a tree format
peopleDF.printSchema()

// Select only the "name" column
peopleDF.select("name").show()

// This import is needed to use the $-notation
import sparkSession.implicits._

// Select everybody, but increment the age by 1
peopleDF.select($"name", $"age" + 1).show()

// Select people older than 21
peopleDF.filter($"age" > 21).show()

// Count people by age
peopleDF.groupBy("age").count().show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+

+-------+---------+
|   name|(age + 1)|
+-------+---------+
|Michael|     null|
|   Andy|       31|
| Justin|       20|
+-------+---------+

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

+----+-----+
| age|count|
+----+-----+
|  19|    1|
|null|    1|
|  30|    1|
+----+-----+



[36mpeopleDF[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32msql[0m.[32mpackage[0m.[32mDataFrame[0m = [age: bigint, name: string]
[32mimport [36msparkSession.implicits._[0m

Dataset example is in the cell below:

In [6]:
// create a dataset using sparkSession.range starting from 5 to 100, with increments of 5
val numDS = sparkSession.range(5, 100, 5)

// order by column
numDS.orderBy("id").show(5)

import sparkSession.implicits._

numDS.orderBy($"id".desc).show(5)

// compute descriptive stats and display them
numDS.describe().show()

+---+
| id|
+---+
|  5|
| 10|
| 15|
| 20|
| 25|
+---+
only showing top 5 rows

+---+
| id|
+---+
| 95|
| 90|
| 85|
| 80|
| 75|
+---+
only showing top 5 rows

+-------+------------------+
|summary|                id|
+-------+------------------+
|  count|                19|
|   mean|              50.0|
| stddev|28.136571693556885|
|    min|                 5|
|    max|                95|
+-------+------------------+



[36mnumDS[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32msql[0m.[32mDataset[0m[[32mjava[0m.[32mlang[0m.[32mLong[0m] = [id: bigint]
[32mimport [36msparkSession.implicits._[0m

Another dataframe example, showing access to columns:

In [10]:
// create a DataFrame using sparkSession.createDataFrame from a List or Seq
val langPercentDF = sparkSession.createDataFrame(List(("Scala", 35), ("Python", 30), ("R", 15), ("Java", 20)))

// rename the columns
val lpDF = langPercentDF.withColumnRenamed("_1", "language").withColumnRenamed("_2", "percent")

// order the DataFrame in descending order of percentage
lpDF.orderBy($"percent".desc).show(false)

+--------+-------+
|language|percent|
+--------+-------+
|Scala   |35     |
|Python  |30     |
|Java    |20     |
|R       |15     |
+--------+-------+



[36mlangPercentDF[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32msql[0m.[32mpackage[0m.[32mDataFrame[0m = [_1: string, _2: int]
[36mlpDF[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32msql[0m.[32mpackage[0m.[32mDataFrame[0m = [language: string, percent: int]

<h3>Reading text</h3>

Aside from creating a dataset by transforming a previous one, we can also read data from a file directly into a dataset:

In [11]:
// Read a csv file
val dfCrime = sparkSession.read.option("header","true").csv("files/SacramentocrimeJanuary2006.csv")
dfCrime.show()

+-----------+--------------------+--------+----------+----+--------------------+-------------+-----------+------------+
|  cdatetime|             address|district|      beat|grid|          crimedescr|ucr_ncic_code|   latitude|   longitude|
+-----------+--------------------+--------+----------+----+--------------------+-------------+-----------+------------+
|1/1/06 0:00|  3108 OCCIDENTAL DR|       3|3C        |1115|10851(A)VC TAKE V...|         2404|38.55042047|-121.3914158|
|1/1/06 0:00| 2082 EXPEDITION WAY|       5|5A        |1512|459 PC  BURGLARY ...|         2204|38.47350069|-121.4901858|
|1/1/06 0:00|          4 PALEN CT|       2|2A        | 212|10851(A)VC TAKE V...|         2404|38.65784584|-121.4621009|
|1/1/06 0:00|      22 BECKFORD CT|       6|6C        |1443|476 PC PASS FICTI...|         2501|38.50677377|-121.4269508|
|1/1/06 0:00|    3421 AUBURN BLVD|       2|2A        | 508|459 PC  BURGLARY-...|         2299| 38.6374478|-121.3846125|
|1/1/06 0:00|  5301 BONNIEMAE WAY|      

[36mdfCrime[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32msql[0m.[32mpackage[0m.[32mDataFrame[0m = [cdatetime: string, address: string ... 7 more fields]

To read plain text as a dataset, we need an extra <tt>import</tt> for schema conversion. Once the text is read in, operations can be carried out to find line lengths, total length of text or anything else you may want to do:

In [12]:
// Read a plain text file
import sparkSession.implicits._

// class converts from dataframe to dataset output
val bookDS = sparkSession.read.text("files/TaleOfTwoCities.txt").as[String]
bookDS.show()

val lineLengths = bookDS.map(s => s.length)

// To maintain lineLengths in memory
//lineLengths.persist()

val totalLength = lineLengths.reduce((a, b) => a + b)
println(lineLengths)




+--------------------+
|               value|
+--------------------+
|It was the best o...|
|                    |
|There were a king...|
|                    |
|It was the year o...|
|                    |
|France, less favo...|
|                    |
|In England, there...|
|                    |
|All these things,...|
+--------------------+

[value: int]


[32mimport [36msparkSession.implicits._[0m
[36mbookDS[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32msql[0m.[32mDataset[0m[[32mString[0m] = [value: string]
[36mlineLengths[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32msql[0m.[32mDataset[0m[[32mInt[0m] = [value: int]
[36mtotalLength[0m: [32mInt[0m = [32m5773[0m

<h3>Transformations</h3>

We create other datasets from an existing dataset using **transformations**. A list of some of the possible transformations is available [here](http://spark.apache.org/docs/latest/programming-guide.html#transformations), and some examples follow:

In [13]:
val words = bookDS.flatMap(value => value.split("\\s+"))
words.show()
val groupedWords = words.groupByKey(_.toLowerCase)

+-------+
|  value|
+-------+
|     It|
|    was|
|    the|
|   best|
|     of|
| times,|
|     it|
|    was|
|    the|
|  worst|
|     of|
| times,|
|     it|
|    was|
|    the|
|    age|
|     of|
|wisdom,|
|     it|
|    was|
+-------+
only showing top 20 rows



[36mwords[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32msql[0m.[32mDataset[0m[[32mString[0m] = [value: string]
[36mgroupedWords[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32msql[0m.[32mKeyValueGroupedDataset[0m[[32mString[0m, [32mString[0m] = org.apache.spark.sql.KeyValueGroupedDataset@c0b4d34

<h3>Actions</h3>

Some of the most common actions are available from [this page](http://spark.apache.org/docs/latest/programming-guide.html#actions). For example, <tt>count</tt> returns the number of elements in the dataset. 

In [14]:
val counts = groupedWords.count()
counts.show()

+-----------+--------+
|      value|count(1)|
+-----------+--------+
|       some|       3|
|      those|       2|
|   received|       1|
|     taking|       1|
|     worked|       1|
|      lords|       2|
|  countries|       1|
|   spending|       1|
|  character|       1|
|    snipped|       1|
|      dozen|       1|
|   chickens|       1|
|      among|       2|
|       even|       1|
|rest--along|       1|
|  cautioned|       1|
|        got|       1|
|        did|       1|
|   conceded|       1|
|        two|       2|
+-----------+--------+
only showing top 20 rows



[36mcounts[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32msql[0m.[32mDataset[0m[([32mString[0m, [32mLong[0m)] = [value: string, count(1): bigint]

<h2>Pipelines</h2>

It is common that a number of algorithms need to run on some data. MLlib allows this to be encoded as a [pipeline](http://spark.apache.org/docs/latest/ml-pipeline.html), and it takes care of input / output of each phase.

We demonstrate a simple pipeline using the task of stop word removal.

In [15]:
// Prepare dataset consisting of (id, text) tuples.
val dataSet = sparkSession.createDataFrame(Seq(
  (0, "I saw the red baloon"),
  (1, "Mary had a little lamb")
)).toDF("id", "text")

val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(dataSet)
wordsData.select("words").show()

+--------------------+
|               words|
+--------------------+
|[i, saw, the, red...|
|[mary, had, a, li...|
+--------------------+



[36mdataSet[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32msql[0m.[32mpackage[0m.[32mDataFrame[0m = [id: int, text: string]
[36mtokenizer[0m: [32mTokenizer[0m = tok_bd1ce41acb00
[36mwordsData[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32msql[0m.[32mpackage[0m.[32mDataFrame[0m = [id: int, text: string ... 1 more field]

As you will have noticed in the previous notebook's exercises, the most common words in a text are often words such as *and*, *so* etc. These are not informative, and could be removed. Our pipeline is in two stages:

1. tokenizer
2. stop word removal

These two stages are to be run in that order, and the input DataFrame will be transformed as it passes through them. Both stages are Transformer stages, and so the <tt>transform()</tt> method will be called on the DataFrame.

In [16]:
// Configure an ML pipeline, which consists of two stages: tokenizer, and stopWordsRemover.

val tokenizer = new Tokenizer()
    .setInputCol("text")
    .setOutputCol("words")

val remover = new StopWordsRemover()
    .setInputCol("words")
    .setOutputCol("filtered")

val pipeline = new Pipeline()
  .setStages(Array(tokenizer,remover))

val model = pipeline.fit(dataSet)
val result = model.transform(dataSet)
result.show()

+---+--------------------+--------------------+--------------------+
| id|                text|               words|            filtered|
+---+--------------------+--------------------+--------------------+
|  0|I saw the red baloon|[i, saw, the, red...|  [saw, red, baloon]|
|  1|Mary had a little...|[mary, had, a, li...|[mary, little, lamb]|
+---+--------------------+--------------------+--------------------+



[36mtokenizer[0m: [32mTokenizer[0m = tok_43963b0eea05
[36mremover[0m: [32mStopWordsRemover[0m = stopWords_a619aebb3ebe
[36mpipeline[0m: [32mPipeline[0m = pipeline_3af25d339b14
[36mmodel[0m: [32mPipelineModel[0m = pipeline_3af25d339b14
[36mresult[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32msql[0m.[32mpackage[0m.[32mDataFrame[0m = [id: int, text: string ... 2 more fields]

To see the full power of pipelines, we present a second example: one which includes an estimator in the form of logistic regression. This pipeline has three steps:

1. split each document's text into words (<i>tokenizer</i>)
2. convert each document's words into a feature vector (<i>hashingTF</i>)
3. learn a prediction model using the features vectors and labels (<i>logistic regression</i>)

For Estimator stages, the <tt>fit()</tt> method is called to produce a Transformer and that Transformer’s <tt>transform()</tt> method is called on the DataFrame.

In [17]:
// Prepare training documents from a list of (id, text, label) tuples.
val training = sparkSession.createDataFrame(Seq(
    (0L, "a b c d e spark", 1.0),
    (1L, "b d", 0.0),
    (2L, "spark f g h", 1.0),
    (3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
    .setInputCol("text")
    .setOutputCol("words")

val hashingTF = new HashingTF()
    .setNumFeatures(1000)
    .setInputCol(tokenizer.getOutputCol)
    .setOutputCol("features")

val lr = new LogisticRegression()
    .setMaxIter(10)
    .setRegParam(0.01)

val pipeline = new Pipeline()
    .setStages(Array(tokenizer, hashingTF, lr))

// Fit the pipeline to training documents.
val model = pipeline.fit(training)

// Prepare test documents, which are unlabeled (id, text) tuples.
val test = sparkSession.createDataFrame(Seq(
    (4L, "spark i j k"),
    (5L, "l m n"),
    (6L, "mapreduce spark"),
    (7L, "apache hadoop")
)).toDF("id", "text")

// Make predictions on test documents.
model.transform(test)
    .select("id", "text", "probability", "prediction")
    .collect()
    .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
        println(s"($id, $text) --> prob=$prob, prediction=$prediction")
    }

(4, spark i j k) --> prob=[0.5406433544852302,0.45935664551476996], prediction=0.0
(5, l m n) --> prob=[0.9334382627383524,0.06656173726164764], prediction=0.0
(6, mapreduce spark) --> prob=[0.7799076868204318,0.22009231317956823], prediction=0.0
(7, apache hadoop) --> prob=[0.9768636139518375,0.023136386048162483], prediction=0.0


[36mtraining[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32msql[0m.[32mpackage[0m.[32mDataFrame[0m = [id: bigint, text: string ... 1 more field]
[36mtokenizer[0m: [32mTokenizer[0m = tok_f559905b18b2
[36mhashingTF[0m: [32mHashingTF[0m = hashingTF_53f2091000d0
[36mlr[0m: [32mLogisticRegression[0m = logreg_3df5de65390e
[36mpipeline[0m: [32mPipeline[0m = pipeline_770b677eee63
[36mmodel[0m: [32mPipelineModel[0m = pipeline_770b677eee63
[36mtest[0m: [32morg[0m.[32mapache[0m.[32mspark[0m.[32msql[0m.[32mpackage[0m.[32mDataFrame[0m = [id: bigint, text: string]

<h2>Exercises</h2>

<h3>Exercise 1</h3>

In the CSV file above, <tt>[SacramentoCrime](http://samplecsvs.s3.amazonaws.com/SacramentocrimeJanuary2006.csv)</tt>, the <tt>ucr_ncic_code</tt> represents the type of crime carried out. Use any transformations / actions to output crime types in descending order of frequency. You should create this as a standalone program.

<h3>Exercise 2</h3>

As well as the "[TaleOfTwoCities.txt](files/TaleOfTwoCities.txt)", the files directory contains the file "[GreatExpectations.txt](files/GreatExpectations.txt)". Read in both files, and find the top 20 most frequent (overall) words that appear in both documents. (You will need to convert the documents to lower case, but you can assume that ends of line and whitespace indicate word boundaries.)

<h3>Exercise 3</h3>

There are a [lot of transformers and estimators](http://spark.apache.org/docs/latest/ml-features.html) implemented within Spark that can be pipelined. Create a pipeline which prints n-grams from the [TaleOfTwoCities.txt](files/TaleOfTwoCities.txt) and the [GreatExpectations.txt](files/GreatExpectations.txt) files.

In [18]:
// Exercise 1


import org.apache.spark.sql.SparkSession

import sparkSession.implicits._

object crimes_descending_freq {
    def main(args: Array[String]): Unit = {
      
        val sparkSession = SparkSession.builder
          .master("local")
          .appName("Crimes")
          .getOrCreate()
        
        
    val dfCrime = sparkSession.read.option("header","true").csv("files/SacramentocrimeJanuary2006.csv")
     dfCrime.show()
    dfCrime.groupBy("ucr_ncic_code").count().orderBy($"count".desc).show()

    }
}

crimes_descending_freq.main(Array())

+-----------+--------------------+--------+----------+----+--------------------+-------------+-----------+------------+
|  cdatetime|             address|district|      beat|grid|          crimedescr|ucr_ncic_code|   latitude|   longitude|
+-----------+--------------------+--------+----------+----+--------------------+-------------+-----------+------------+
|1/1/06 0:00|  3108 OCCIDENTAL DR|       3|3C        |1115|10851(A)VC TAKE V...|         2404|38.55042047|-121.3914158|
|1/1/06 0:00| 2082 EXPEDITION WAY|       5|5A        |1512|459 PC  BURGLARY ...|         2204|38.47350069|-121.4901858|
|1/1/06 0:00|          4 PALEN CT|       2|2A        | 212|10851(A)VC TAKE V...|         2404|38.65784584|-121.4621009|
|1/1/06 0:00|      22 BECKFORD CT|       6|6C        |1443|476 PC PASS FICTI...|         2501|38.50677377|-121.4269508|
|1/1/06 0:00|    3421 AUBURN BLVD|       2|2A        | 508|459 PC  BURGLARY-...|         2299| 38.6374478|-121.3846125|
|1/1/06 0:00|  5301 BONNIEMAE WAY|      

[32mimport [36morg.apache.spark.sql.SparkSession[0m
[32mimport [36msparkSession.implicits._[0m
defined [32mobject [36mcrimes_descending_freq[0m

In [19]:
// Exercise 2

import org.apache.spark.sql.SparkSession

import sparkSession.implicits._


object frequent_20 {
    def main(args: Array[String]): Unit = {
      
        val sparkSession = SparkSession.builder
          .master("local")
          .appName("top_20")
          .getOrCreate()
        
        val tale = sparkSession.read.text("files/TaleOfTwoCities.txt").as[String]
        val great = sparkSession.read.text("files/GreatExpectations.txt").as[String]
    
        val words_tale = tale.flatMap(value => value.split("\\s+"))
        val words_great = great.flatMap(value => value.split("\\s+"))

        val grouped_Words_tale = words_tale.groupByKey(_.toLowerCase)
        val grouped_Words_great = words_great.groupByKey(_.toLowerCase)
    
        val counts_tale = grouped_Words_tale.count()
        //REQUIRED TO VISUALISE COLUMN NAME: counts_tale.show()
        counts_tale.orderBy($"count(1)".desc).show()
        
        val counts_great = grouped_Words_great.count()
        //REQUIRED TO VISUALISE COLUMN NAME: counts_great.show()
        counts_great.orderBy($"count(1)".desc).show()
        
        
        val joined = counts_tale.join(counts_great, Seq("value"))
        
        val newcolumns = Seq("Words", "Counts_tale", "Counts_age")
        val dataframe = joined.toDF(newcolumns: _*)
        val final_df = dataframe.select($"Words", $"Counts_tale" + $"Counts_age")
        final_df.orderBy($"(Counts_tale + Counts_age)".desc).show()
    }
}

frequent_20.main(Array())

+-----+--------+
|value|count(1)|
+-----+--------+
|  the|      76|
|   of|      55|
|  and|      40|
|   in|      26|
|    a|      23|
|   to|      23|
|  was|      21|
|   it|      14|
| with|      14|
| that|      12|
| were|      11|
|   by|      11|
|  had|      10|
|   on|       9|
|  his|       9|
|   as|       6|
|   at|       6|
|  for|       6|
|  one|       5|
|     |       5|
+-----+--------+
only showing top 20 rows

+-----+--------+
|value|count(1)|
+-----+--------+
|  the|      88|
|  and|      85|
|    i|      47|
|    a|      47|
|     |      43|
|   to|      41|
|   of|      38|
| that|      30|
|   he|      27|
|   my|      24|
|   me|      22|
|  his|      21|
|   in|      21|
|  you|      21|
|  was|      21|
| with|      16|
|   as|      16|
|   on|      14|
| said|      14|
|   at|      14|
+-----+--------+
only showing top 20 rows

+-----+--------------------------+
|Words|(Counts_tale + Counts_age)|
+-----+--------------------------+
|  the|                    

[32mimport [36morg.apache.spark.sql.SparkSession[0m
[32mimport [36msparkSession.implicits._[0m
defined [32mobject [36mfrequent_20[0m

In [20]:
// Exercise 3

import org.apache.spark.sql.SparkSession

import sparkSession.implicits._

import org.apache.spark.ml.feature.NGram


object ngrams {
    def main(args: Array[String]): Unit = {
      
        val sparkSession = SparkSession.builder
          .master("local")
          .appName("ngrams")
          .getOrCreate()
        
        val tale = sparkSession.read.text("files/TaleOfTwoCities.txt").as[String]
        val great = sparkSession.read.text("files/GreatExpectations.txt").as[String]
    
        val tokenizer = new Tokenizer().setInputCol("value").setOutputCol("words")
        
        val ngram2 = new NGram().setN(2).setInputCol(tokenizer.getOutputCol).setOutputCol("ngrams2")
        
        val pipeline = new Pipeline().setStages(Array(tokenizer,ngram2))
    
        val model_tale = pipeline.fit(tale)
        val result_tale = model_tale.transform(tale)
        result_tale.show()
        
        val model_great = pipeline.fit(great)
        val result_great = model_great.transform(great)
        result_great.show()
    }
    
}

ngrams.main(Array())





+--------------------+--------------------+--------------------+
|               value|               words|             ngrams2|
+--------------------+--------------------+--------------------+
|It was the best o...|[it, was, the, be...|[it was, was the,...|
|                    |                  []|                  []|
|There were a king...|[there, were, a, ...|[there were, were...|
|                    |                  []|                  []|
|It was the year o...|[it, was, the, ye...|[it was, was the,...|
|                    |                  []|                  []|
|France, less favo...|[france,, less, f...|[france, less, le...|
|                    |                  []|                  []|
|In England, there...|[in, england,, th...|[in england,, eng...|
|                    |                  []|                  []|
|All these things,...|[all, these, thin...|[all these, these...|
+--------------------+--------------------+--------------------+

+--------------------+--

[32mimport [36morg.apache.spark.sql.SparkSession[0m
[32mimport [36msparkSession.implicits._[0m
[32mimport [36morg.apache.spark.ml.feature.NGram[0m
defined [32mobject [36mngrams[0m