Spark pipelines work in a very similar way as scikit-learn counterparts.

Differences:
1. More verbose
2. Transformers accept a dataframe as input and transform it by adding more columns to the dataframe
3. Transformers have less useful defaults
    
Second point makes the pipelines more flexible because inputs and outputs don't have to be matrices.

Scala example

You have a column in a dataset called `person_description`

```scala
val tokenizer = new RegexTokenizer()
  .setToLowercase(true)
  .setPattern("(?u)\\b\\w\\w+\\b") // default scikit-learn
  .setGaps(false)
  .setInputCol("person_description")
  .setOutputCol("person_description_words")

val vectorizer = new CountVectorizer()
  .setMinDF(5)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("person_description_tf")
  
val tokenizer2 = new RegexTokenizer()
  .setToLowercase(true)
  .setPattern("(?u)\\b\\w\\w+\\b") // default scikit-learn
  .setGaps(false)
  .setInputCol("company_description")
  .setOutputCol("company_description_words")

val vectorizer2 = new CountVectorizer()
  .setMinDF(5)
  .setInputCol(tokenizer2.getOutputCol)
  .setOutputCol("company_description_tf")
```

After this step there should be 6 columns in the dataframe:
- person_description (original one)
- person_description_words - list of strings (words)
- person_description_tf (tf - term frequency) - list of numerical indices
- company_description
- company_description_words
- company_description_tf

To assemble final feature set there is a `VectorAssembler` class

```scala
val va = new VectorAssembler()
  .setInputCols(Array("person_description_tf", "company_description_tf"))
  .setOutputCol("features")

val numRound = 100
val numWorkers = 4
val paramMap = List(
  "eta" -> 0.1f,
  "max_depth" -> 6,
  "min_child_weight" -> 3.0,
  "subsample" -> 1.0,
  "colsample_bytree" -> 0.82,
  "colsample_bylevel" -> 0.9,
  "base_score" -> 0.005,
  "eval_metric" -> "auc",
  "seed" -> 49,
  "silent" -> 1,
  "objective" -> "binary:logistic").toMap

val model = new XGBoostEstimator(xgboostParams = paramMap, round = numRound, nWorkers = numWorkers)
```

Exercise
-------------------

Use data from the previous exercise (company industry classification) to create a pipeline in Spark