# Building a recommender with BigData Republic
<img src="https://www.bigdatarepublic.nl/wp-content/uploads/2019/05/BDR_Logo_RGB_no_whitespace_small.jpg" alt="BigData Republic" style="width: 400px;"/>

This notebook serves as an exercise to get familiar with recommender systems. In the big data industry, building a recommender is a very common use case so getting familiar with this is important both for aspiring data scientists and data engineers. Throughout this notebook we will show you how to process a real-world dataset with all its hurdles. Importantly, real-world data is often entirely different from pre-cleaned data you will find on e.g. Kaggle and requires a lot more preprocessing. Some of this preprocessing we have already done for you, but you will still need to do some preprocessing yourself before you can get started building your recommender. This notebook consists of three parts: preprocessing your data, building a recommender system using matrix factorization and evaluating your results.


## The data
<img src="https://truckstar.nl/app/uploads/2000/01/randstad-logo-share.png" alt="Randstad" style="width: 200px;"/>

We have prepared a subset of a real-world dataset for you from _Randstad_, one of our clients. Randstad is the biggest Dutch employment agency that processes thousands of vacancies every month. They have agreed to share this data for educational purposes, as long as data is appropriately anonymized and subsetted. We therefore provide you with a dataset of vacancy/candidate combinations. Here, a combination can either mean that a candidate clicked on a vacancy, that a candidate started an application procedure or a candidate applied to a vacancy. Per combination, we also provide some extra information, such as the function description and the physical distance between the company and the candidate. This might be useful to increase the accuracy of your model. Furthermore, we provide a table with _profile data_. This contains some extra information on the candidates, such as their desired hourly wage and the amount of hours they want to work per week. This might also be useful to enrich your model later on.

Since this data is shared with you privately and only for this course, **we ask you not to redistribute it**.

## Getting the data:
First, we download the data from Amazon S3 storage and write it to a file. In total we're talking about approximately 300MB of data.

In [ ]:
import sys.process._
import java.net.URL
import java.io.File
import scala.language.postfixOps

val src = "https://s3-eu-west-1.amazonaws.com/bdr-college/"
val src2 = "https://bdr-college.s3-eu-west-1.amazonaws.com/"
val dst = "notebooks/"

if (!new java.io.File("notebooks/profile_data.csv").exists) {
  new URL(src + "profile_data.csv") #> new File(dst + "profile_data.csv") !!
}
if (!new java.io.File("notebooks/click_data_train.csv").exists) {
  new URL(src + "click_data_train.csv") #> new File(dst + "click_data_train.csv") !!
}
if (!new java.io.File("notebooks/.csv").exists) {
  new URL(src + "click_data_val.csv") #> new File(dst + "click_data_val.csv") !!
}
if (!new java.io.File("notebooks/zipcode_distances.csv").exists) {
  new URL(src + "zipcode_distances.csv") #> new File(dst + "zipcode_distances.csv") !!
}
if (!new java.io.File("notebooks/vacancies_validation.csv").exists) {
  new URL(src + "vacancies_validation.csv") #> new File(dst + "vacancies_validation.csv") !!
}

if (!new java.io.File("notebooks/click_data_test.csv").exists) {
  new URL(src2 + "click_data_test.csv") #> new File(dst + "click_data_test.csv") !!
}
if (!new java.io.File("notebooks/vacancies_test.csv").exists) {
  new URL(src2 + "vacancies_test.csv") #> new File(dst + "vacancies_test.csv") !!
}

import sys.process._
import java.net.URL
import java.io.File
import scala.language.postfixOps
src: String = https://s3-eu-west-1.amazonaws.com/bdr-college/
src2: String = https://bdr-college.s3-eu-west-1.amazonaws.com/
dst: String = notebooks/
res159: Any = ()


Next, we read that data into Spark DataFrames:

In [ ]:
val spark = SparkSession
   .builder()
   .appName("BDRAssignment")
   .getOrCreate()

var profiles = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("notebooks/profile_data.csv")
var clicks_train = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("notebooks/click_data_train.csv")
var clicks_val = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("notebooks/click_data_val.csv")
var zipcode_distances = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("notebooks/zipcode_distances.csv")
var vacancies_val = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("notebooks/vacancies_validation.csv")

clicks_val = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("notebooks/click_data_test.csv")
vacancies_val = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("notebooks/vacancies_test.csv")

spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@34a4033e
profiles: org.apache.spark.sql.DataFrame = [_c0: int, candidate_number: int ... 5 more fields]
clicks_train: org.apache.spark.sql.DataFrame = [_c0: int, Unnamed: 0: int ... 10 more fields]
clicks_val: org.apache.spark.sql.DataFrame = [_c0: int, Unnamed: 0: int ... 10 more fields]
zipcode_distances: org.apache.spark.sql.DataFrame = [_c0: int, from: int ... 2 more fields]
vacancies_val: org.apache.spark.sql.DataFrame = [_c0: int, vacancy_number: int ... 4 more fields]
clicks_val: org.apache.spark.sql.DataFrame = [_c0: int, Unnamed: 0: int ... 10 more fields]
vacancies_val: org.apache.spark.sql.DataFrame = [_c0: int, vacancy_number: int ... 4 more fields]


Since we have both clicks and applications from candidates, we can view these as ratings and add a weighting here to make applications weigh stronger than clicks. This usually gives better results. The "ecom_action" column in the clicks dataset determines the action of the candidate corresponding to that vacancy. 1 and 2 is a page view, 3 and 5 is starting an application procedure and a 6 is finishing an application. We use these values as weights for the model, however these initial values might be suboptimal. Experiment with different values for the optimal result. The values you choose as replacement directly influences the ratio of importance between a click and an application.

### Describing the data:

In [ ]:
profiles.show(5, false)
profiles.describe()

+---+----------------+-----------------------+--------------+--------------+-------------------+-------+
|_c0|candidate_number|maximum_travel_distance|week_hours_min|week_hours_max|candidate_hour_wage|cand_pc|
+---+----------------+-----------------------+--------------+--------------+-------------------+-------+
|0  |1330691123      |50.0                   |16.0          |24.0          |16.0               |1357   |
|1  |1050390808      |50.0                   |24.0          |32.0          |13.5               |1353   |
|2  |450737787       |25.0                   |24.0          |40.0          |14.423076923076923 |8242   |
|3  |709909933       |30.0                   |32.0          |32.0          |15.865384615384615 |1313   |
|4  |1703401364      |5.0                    |40.0          |40.0          |11.234817813765183 |1353   |
+---+----------------+-----------------------+--------------+--------------+-------------------+-------+
only showing top 5 rows

res162: org.apache.spark.sql.D

In [ ]:
clicks_train.show(5, false)
clicks_train.describe()

+---+----------+------------+----------------+--------------+-------------------+-------------------------+-----------------+--------+----------+-----------+----------+
|_c0|Unnamed: 0|Unnamed: 0.1|candidate_number|vacancy_number|date_action        |function_name            |request_hour_wage|distance|week_hours|ecom_action|company_pc|
+---+----------+------------+----------------+--------------+-------------------+-------------------------+-----------------+--------+----------+-----------+----------+
|1  |1         |1           |425458424       |790077500     |2017-02-20 16:40:11|medewerker klantenservice|10.0             |1.0     |40.0      |6          |1062.0    |
|2  |2         |2           |617788096       |790077500     |2017-02-14 13:26:46|medewerker klantenservice|10.0             |3.0     |40.0      |6          |1062.0    |
|5  |5         |5           |245302391       |790077500     |2017-02-17 14:00:08|medewerker klantenservice|10.0             |4.0     |40.0      |2         

In [ ]:
clicks_val.show(5, false)
clicks_val.describe()

+---+----------+------------+----------------+--------------+-------------------+-------------------------+-----------------+--------+----------+-----------+----------+
|_c0|Unnamed: 0|Unnamed: 0.1|candidate_number|vacancy_number|date_action        |function_name            |request_hour_wage|distance|week_hours|ecom_action|company_pc|
+---+----------+------------+----------------+--------------+-------------------+-------------------------+-----------------+--------+----------+-----------+----------+
|9  |9         |10207       |94815691        |2140691097    |2018-01-18 12:51:36|transportplanner         |13.26            |7.0     |40.0      |6          |1046.0    |
|28 |28        |10346       |470491779       |103296810     |2018-01-15 11:33:15|logistiek medewerker     |10.08            |70.0    |36.0      |2          |5911.0    |
|49 |49        |10442       |209188629       |1029829076    |2018-01-15 07:12:06|productiemedewerker      |11.73            |31.0    |28.0      |2         

In [ ]:
zipcode_distances.show(5, false)
zipcode_distances.describe()

+--------+----+----+--------+
|_c0     |from|to  |distance|
+--------+----+----+--------+
|10111011|1011|1011|0.0     |
|10111052|1011|1052|4.0     |
|10111068|1011|1068|17.0    |
|10111091|1011|1091|2.0     |
|10111244|1011|1244|22.0    |
+--------+----+----+--------+
only showing top 5 rows

res168: org.apache.spark.sql.DataFrame = [summary: string, _c0: string ... 3 more fields]


In [ ]:
vacancies_val.show(5, false)
vacancies_val.describe()

+---+--------------+-------------------------------------+------------------+----------+----------+
|_c0|vacancy_number|function_name                        |request_hour_wage |week_hours|company_pc|
+---+--------------+-------------------------------------+------------------+----------+----------+
|0  |1128985       |management assistent                 |17.307692307692307|40.0      |1057.0    |
|1  |1388552       |postsorteerder                       |9.2               |12.0      |1066.0    |
|2  |1472783       |customer service medewerker logistiek|14.423076923076925|40.0      |5632.0    |
|3  |1961784       |catering medewerker a                |10.25             |25.0      |3454.0    |
|4  |2177183       |vrachtwagenchauffeur                 |13.92             |40.0      |6181.0    |
+---+--------------+-------------------------------------+------------------+----------+----------+
only showing top 5 rows

res170: org.apache.spark.sql.DataFrame = [summary: string, _c0: string ... 

### Assignment:
* Convert all ones in the "ecom_action" column to a 2
* Convert the 3's  and 5's  to 4's
* Try to come up with a data-driven reasoning for choosing the ratio between clicks and applications

In [ ]:
clicks_val.groupBy("ecom_action")
          .count()
          .show()

+-----------+-----+
|ecom_action|count|
+-----------+-----+
|          1|    2|
|          6| 5610|
|          3|   43|
|          5| 1248|
|          2| 9324|
+-----------+-----+



In [ ]:
import org.apache.spark.sql.functions._

// Hint: make a UDF that you can apply on a certain column in a dataframe
def ecomActionRemap(s: String): Option[Integer] = {
  var i = 0;
  try {
    i = s.toInt;
  } catch {
    case e: Exception => Some(-1);
  }
  
  if (i == 1) {
    Some(2)
  } else if (i == 3 || i == 5) {
    Some(4)
  } else {
    Some(i);
  }
}

val tRemap = udf((f: String) => ecomActionRemap(f))

import org.apache.spark.sql.functions._
ecomActionRemap: (s: String)Option[Integer]
tRemap: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))


In [ ]:
clicks_val = clicks_val.withColumn("ecom_action", tRemap('ecom_action))

clicks_val.groupBy("ecom_action")
          .count()
          .show()

+-----------+-----+
|ecom_action|count|
+-----------+-----+
|          6| 5610|
|          4| 1291|
|          2| 9326|
+-----------+-----+

clicks_val: org.apache.spark.sql.DataFrame = [_c0: int, Unnamed: 0: int ... 10 more fields]


After modifying the "ecom_action" column we can now start training our recommender. For this we want to perform matrix factorization using the Alternative Least Squares algorithm. In Spark, we can do this using the org.apache.spark.ml.recommendation library. Use this library to perform ALS and make sure you understand the parameters that you need. There are some important choices that you will need to make, such as whether you want to use explicit or implicit matrix factorization and which values you want to use for regularization and the latent matrix rank. You can check out all these settings and more in the documentation: https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html.

### Assignment:
* Perform ALS on the clicks_train dataset
* Check whether to use implicit or explicit matrix factorization
* Look into the possible hyperparameters of the ALS function

In [ ]:
import org.apache.spark.ml.recommendation.{ALS, ALSModel}

// Hint: check out the documentation on ALS in Spark
val als = new ALS()
  .setMaxIter(6)
  .setRegParam(1)
  .setAlpha(1)
  .setUserCol("candidate_number")
  .setItemCol("vacancy_number")
  .setRatingCol("ecom_action")

val model = als.fit(clicks_train)
model.setColdStartStrategy("drop")

val predictions = model.transform(clicks_val)

predictions.show(5, false)
predictions.describe()

+-----+----------+------------+----------------+--------------+-------------------+--------------------+-----------------+--------+----------+-----------+----------+----------+
|_c0  |Unnamed: 0|Unnamed: 0.1|candidate_number|vacancy_number|date_action        |function_name       |request_hour_wage|distance|week_hours|ecom_action|company_pc|prediction|
+-----+----------+------------+----------------+--------------+-------------------+--------------------+-----------------+--------+----------+-----------+----------+----------+
|40905|40905     |604290      |435595514       |7560806       |2018-01-30 14:53:14|helpende            |14.0             |28.0    |20.0      |2          |3901.0    |1.7187321 |
|40900|40900     |604268      |428370701       |7560806       |2018-01-03 08:04:32|helpende            |14.0             |0.0     |20.0      |2          |3901.0    |1.7633929 |
|40896|40896     |604237      |494542895       |7560806       |2018-01-28 17:50:44|helpende            |14.0       

In [ ]:
import org.apache.spark.ml.evaluation.RegressionEvaluator

val evaluator = new RegressionEvaluator()
  .setMetricName("rmse")
  .setLabelCol("ecom_action")
  .setPredictionCol("prediction")

val rmse = evaluator.evaluate(predictions)

import org.apache.spark.ml.evaluation.RegressionEvaluator
evaluator: org.apache.spark.ml.evaluation.RegressionEvaluator = regEval_02b2bd02c0cc
rmse: Double = 2.242446973836365


You have now built your first recommender! However, there is a very important issue with the way we have set up this model. Think of the goal of this model. We can either try to recommend candidates to vacancies or vice versa, but vacancies are volatile! They come and go and are not always open to applications. To illustrate this, an example: let's say we've trained a recommender on data up until December 2017. There could exist a vacancy V_1 that was open in November 2017 which might be a great fit for a candidate C_1 looking for a job in May 2018, but we don't want to recommend him old vacancies because they are already closed. Of course, we can add a postprocessing filter to the output of the recommender to catch these erroneous recommendations, but still this won't fix the issue because there won't be any results left. This is caused by the recommender not being able to recommend vacancies that were not present in the training set (i.e. no vacancies that were posted after December 2017 will be recommended). Ergo, we need to find some way for the recommender to base its recommendations of NEW vacancies on the OLD vacancies. For this, we need the context of the vacancy, so that we can make comparisons between similar vacancies, instead of looking at each vacancy as a separate entity. We could, for example, look at the vacancy text to find common keywords. To keep it simple, we have added the "function_name" to each candidate/vacancy pair. This function name describes the category of the vacancy, and can be used to recommend job categories instead of specific vacancies. If we can recommend job categories (which can still be quite specific) we can later on search the most recent vacancies for that category and recommend those. This way, we solve the problem of not having any available data for recent vacancies.

### Assignment:
* Perform ALS on the function_name (item col) and candidate_number (user col)
* ALS requires its inputs to be integers. Since the function_name is a string, you first need to convert this to an integer index. We've already done this for you.
* There can be multiple rows for a single candidate/function_name pair. Make sure to aggregate the ratings (e.g. by summing the "ecom_action" column)

In [ ]:
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.IndexToString
import org.apache.spark.ml.recommendation.{ALS, ALSModel}

// We use a StringIndexer to convert strings to integers
val indexer = new StringIndexer()
  .setInputCol("function_name")
  .setOutputCol("function_index")
  .setHandleInvalid("skip")

// Fit the indexer
val stringIndexerModel = indexer.fit(clicks_train)

// By transforming the stringindexer we get a new column 'function_index'
val clicks_train_indexed = stringIndexerModel.transform(clicks_train)
val clicks_val_indexed = stringIndexerModel.transform(clicks_val)

// Hint: aggregate actions per candidate/function combination (e.g. group by the candidate number and function id and sum the rating per group)
val grouped_train = clicks_train_indexed.groupBy("candidate_number", "function_index").sum("ecom_action")
val grouped_val = clicks_val_indexed.groupBy("candidate_number", "function_index").sum("ecom_action")

import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.IndexToString
import org.apache.spark.ml.recommendation.{ALS, ALSModel}
indexer: org.apache.spark.ml.feature.StringIndexer = strIdx_8f8f05d38bb7
stringIndexerModel: org.apache.spark.ml.feature.StringIndexerModel = strIdx_8f8f05d38bb7
clicks_train_indexed: org.apache.spark.sql.DataFrame = [_c0: int, Unnamed: 0: int ... 11 more fields]
clicks_val_indexed: org.apache.spark.sql.DataFrame = [_c0: int, Unnamed: 0: int ... 11 more fields]
grouped_train: org.apache.spark.sql.DataFrame = [candidate_number: int, function_index: double ... 1 more field]
grouped_val: org.apache.spark.sql.DataFrame = [candidate_number: int, function_index: double ... 1 more field]


In [ ]:
// Hint: run ALS in the same manner as you did before
// Don't make the rank and the number of iterations to high, it can cause memory issues

val als = new ALS()
  .setMaxIter(6)
  .setRegParam(1)
  .setAlpha(1)
  .setUserCol("candidate_number")
  .setItemCol("function_index")
  .setRatingCol("sum(ecom_action)")

val model = als.fit(grouped_train)
model.setColdStartStrategy("drop")

val predictions = model.transform(grouped_val)

predictions.show(5, false)
predictions.describe()

+----------------+--------------+----------------+----------+
|candidate_number|function_index|sum(ecom_action)|prediction|
+----------------+--------------+----------------+----------+
|500727530       |148.0         |4               |8.060758  |
|619054112       |148.0         |2               |4.2655196 |
|429072811       |148.0         |10              |2.241808  |
|235344475       |148.0         |10              |2.4114375 |
|401992706       |148.0         |8               |4.430622  |
+----------------+--------------+----------------+----------+
only showing top 5 rows

als: org.apache.spark.ml.recommendation.ALS = als_77a7b80b52c4
model: org.apache.spark.ml.recommendation.ALSModel = als_77a7b80b52c4
predictions: org.apache.spark.sql.DataFrame = [candidate_number: int, function_index: double ... 2 more fields]
res181: org.apache.spark.sql.DataFrame = [summary: string, candidate_number: string ... 3 more fields]


## Sample predictions
Lets see if our recommender is producing expected results. We will look at a candidates clicking behavior and compare it with the output of the recommender for that candidate.

In [ ]:
// Take the first candidate
val candidate = clicks_train.select("candidate_number").take(1)

// See the clicks of the first candidate in the training set
val click_behavior = clicks_train.filter(col("candidate_number") === candidate(0).getInt(0))

click_behavior.show(10, false)

+-----+----------+------------+----------------+--------------+-------------------+-----------------------------------+------------------+--------+----------+-----------+----------+
|_c0  |Unnamed: 0|Unnamed: 0.1|candidate_number|vacancy_number|date_action        |function_name                      |request_hour_wage |distance|week_hours|ecom_action|company_pc|
+-----+----------+------------+----------------+--------------+-------------------+-----------------------------------+------------------+--------+----------+-----------+----------+
|1    |1         |1           |425458424       |790077500     |2017-02-20 16:40:11|medewerker klantenservice          |10.0              |1.0     |40.0      |6          |1062.0    |
|5446 |5446      |5446        |425458424       |1207882682    |2017-06-13 04:12:47|callcentermedewerker inbound       |10.0              |6.0     |32.0      |2          |1014.0    |
|6087 |6087      |6087        |425458424       |1396860979    |2017-03-21 20:10:14|adminis

In [ ]:
// We need all the possible function ids in the training set
val function_ids = grouped_train.select("function_index")
                                .distinct

// Crossjoining the function ids with the candidate provides us with a dataframe that our model can use for prediction, 
// For every candidate_number (in this case only the first one) and function_id, the model can calculate the predicted score
val candidate_functions = clicks_train.select("candidate_number")
                                      .limit(1)
                                      .crossJoin(function_ids)

// Predict the activations
val activations = model.transform(candidate_functions)

// Invert the stringindexed results for verification purposes
val converter = new IndexToString()
  .setInputCol("function_index")
  .setOutputCol("function_name")

// Convert the function indices to function names
val converted = converter.transform(activations)

// Show the top predicted functions
converted.orderBy(desc("prediction"))
         .show(10, false)

+----------------+--------------+----------+----------------------------+
|candidate_number|function_index|prediction|function_name               |
+----------------+--------------+----------+----------------------------+
|425458424       |0.0           |98.52368  |administratief medewerker   |
|425458424       |3.0           |21.819305 |medewerker klantenservice   |
|425458424       |934.0         |17.847637 |flexcoordinator             |
|425458424       |881.0         |16.993267 |fotograaf                   |
|425458424       |847.0         |15.753659 |geneeskundige               |
|425458424       |5.0           |15.4159565|callcentermedewerker        |
|425458424       |938.0         |14.906261 |plaatwerker (middelbaar)    |
|425458424       |985.0         |14.69123  |cursusco?rdinator           |
|425458424       |6.0           |14.185012 |callcentermedewerker inbound|
|425458424       |811.0         |13.895238 |consultant ict              |
+----------------+--------------+-----

## Predict the validation set
Testing recommender systems is a difficult task. A/B testing is usually the best option for the evaluation of a model. However we have no possibility if doing this in the timeframe of this exercise. To still get a general idea about the performance we will predict a ranking for vacancies in the validation set, and compare the real behavior of the candidates with our predictions. 

First we extract all unique candidates and unique function IDs in the validation set.

In [ ]:
var candidate_numbers = clicks_val_indexed.select("candidate_number").distinct
val function_ids = grouped_val.select("function_index").distinct

candidate_numbers: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [candidate_number: int]
function_ids: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [function_index: double]


We want to predict a rating for every vacancy in the vacancies_val dataframe. In order to obtain the rating, we need to predict the job function score per candidate and optimize it using hourly wage, hours per week and distance as features. Lets take a look at the vacancies_val dataframe

In [ ]:
vacancies_val = stringIndexerModel.transform(vacancies_val)
vacancies_val.show(10, false)

+---+--------------+-------------------------------------+------------------+----------+----------+--------------+
|_c0|vacancy_number|function_name                        |request_hour_wage |week_hours|company_pc|function_index|
+---+--------------+-------------------------------------+------------------+----------+----------+--------------+
|0  |1128985       |management assistent                 |17.307692307692307|40.0      |1057.0    |20.0          |
|1  |1388552       |postsorteerder                       |9.2               |12.0      |1066.0    |97.0          |
|2  |1472783       |customer service medewerker logistiek|14.423076923076925|40.0      |5632.0    |34.0          |
|3  |1961784       |catering medewerker a                |10.25             |25.0      |3454.0    |139.0         |
|4  |2177183       |vrachtwagenchauffeur                 |13.92             |40.0      |6181.0    |12.0          |
|5  |2424674       |planner                              |13.269230769230768|40.

We are going to work with large dataframes, therefore we define a helper function take_top_n_grouped that selects the n highest records based on a sort column in the group.

In [ ]:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
import org.apache.spark.sql.functions.{rank, desc}

def takeTopGrouped(df: DataFrame, n: Int, sortColumn: String, groupColumn: String) : DataFrame = {
  val w = Window.partitionBy(groupColumn)
                .orderBy(desc(sortColumn))

  return df.withColumn("rank", row_number().over(w))
           .where($"rank" <= n)
}

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
import org.apache.spark.sql.functions.{rank, desc}
takeTopGrouped: (df: org.apache.spark.sql.DataFrame, n: Int, sortColumn: String, groupColumn: String)org.apache.spark.sql.DataFrame


For every unique candidate in the validation set we create their function predictions by joining the function ids with the candidates. The model transforms these pairs to an activation of which we select the top 5 activations per candidate.

In [ ]:
// Create all the candidate-function pairs
val validation = candidate_numbers.crossJoin(function_ids)

// Predict the function ids score in the validation set for all users using the ALS model
var predictions = model.transform(validation)

// For each candidate, select the top 5 job functions
predictions = takeTopGrouped(predictions, n=5, sortColumn="prediction", groupColumn="candidate_number")

validation: org.apache.spark.sql.DataFrame = [candidate_number: int, function_index: double]
predictions: org.apache.spark.sql.DataFrame = [candidate_number: int, function_index: double ... 2 more fields]
predictions: org.apache.spark.sql.DataFrame = [candidate_number: int, function_index: double ... 2 more fields]


To reduce the amount of vacancies we have to predict per candidate we filter the vacancies in the validation set by the top predicted functions per candidate. By joining these vacancies with the predictions we get new vacancy/candidate pairs. 
There is one feature missing which is the distance for the candidate to the job. By joining with the profile information we can add the candidate zipcode to the vacancy/candidate pairs. 
Both candidate and vacancy zipcodes are then in the dataframe. The distance is obtained by joining the dataframe with the zipcode_distances table.

In [ ]:
// Add vacancies containing the correct function ids
val vacancies_filtered = vacancies_val.join(predictions, Seq("function_index"), "inner")

// We only need the candidate zipcode.
// Hint: you can try taking more columns (such as hourly wage) to incorporate additional features into the model.
val profiles_limited = profiles.select("candidate_number", "cand_pc")

// Add zipcode by joining on candidate number
val vacancies_zipcodes = vacancies_filtered.join(profiles_limited, Seq("candidate_number"), "left")

// Join on company postal code and candidate postal code
val vacancies_distance = vacancies_zipcodes.join(zipcode_distances,
                                                 vacancies_zipcodes("company_pc") === zipcode_distances("to")
                                                   && vacancies_zipcodes("cand_pc") === zipcode_distances("from"), "left")

vacancies_filtered: org.apache.spark.sql.DataFrame = [function_index: double, _c0: int ... 8 more fields]
profiles_limited: org.apache.spark.sql.DataFrame = [candidate_number: int, cand_pc: int]
vacancies_zipcodes: org.apache.spark.sql.DataFrame = [candidate_number: int, function_index: double ... 9 more fields]
vacancies_distance: org.apache.spark.sql.DataFrame = [candidate_number: int, function_index: double ... 13 more fields]


All features are now present in our large table, however the values differ a lot between the features, therefore we need to normalize the distance, hour wage and function prediction between 0 and 1.

In [ ]:
// First calculate the maximum per candidate for each feature and create 
val max_distances = vacancies_distance
  .groupBy("candidate_number")
  .agg(max($"distance").alias("max_distance"),
       max($"request_hour_wage").alias("max_request_hour_wage"),
       max($"prediction").alias("max_prediction"))

// Join the max distances in the vacancies table. Every candidate has now a unique maximum
var vacancies_max_joined = vacancies_distance.join(max_distances, Seq("candidate_number"), "inner")

vacancies_max_joined = vacancies_max_joined.withColumn("normalized_distance", lit(1) - col("distance").divide(col("max_distance")))
vacancies_max_joined = vacancies_max_joined.withColumn("normalized_prediction", col("prediction").divide(col("max_prediction")))
vacancies_max_joined = vacancies_max_joined.withColumn("normalized_request_hour_wage", col("request_hour_wage").divide(col("max_request_hour_wage")))

max_distances: org.apache.spark.sql.DataFrame = [candidate_number: int, max_distance: double ... 2 more fields]
vacancies_max_joined: org.apache.spark.sql.DataFrame = [candidate_number: int, function_index: double ... 19 more fields]
vacancies_max_joined: org.apache.spark.sql.DataFrame = [candidate_number: int, function_index: double ... 19 more fields]
vacancies_max_joined: org.apache.spark.sql.DataFrame = [candidate_number: int, function_index: double ... 19 more fields]
vacancies_max_joined: org.apache.spark.sql.DataFrame = [candidate_number: int, function_index: double ... 19 more fields]


With the normalized features we only have to combine them to create a prediction for that vacancy. The hour wage is weighted half, however these weights can be experimented with. For every candidate we select the 15 highest scored vacancies wich should appear on the first page if a candidate searched for vacancies on the Randstad site.

In [ ]:
// Sum our three features. Hint: you can try playing with the multipliers of the features to see
// what works better. You can even do a grid search to really optimize your multipliers!
val scored_df = vacancies_max_joined.withColumn("total_score",
  col("normalized_distance") + col("normalized_prediction") + lit(0.5) * col("normalized_request_hour_wage"))

// Take the top 15 per candidate
val top_n_scored_df = takeTopGrouped(scored_df, n=15, sortColumn="total_score", groupColumn="candidate_number")

scored_df: org.apache.spark.sql.DataFrame = [candidate_number: int, function_index: double ... 20 more fields]
top_n_scored_df: org.apache.spark.sql.DataFrame = [candidate_number: int, function_index: double ... 20 more fields]


## Evaluate the predictions
The last step is evaluating our predictions. We calculate the percentage of true applications that appeared in our predictions (recall).

In [ ]:
// Select applications only
clicks_val = clicks_val.drop("action")
val true_set = clicks_val.filter($"ecom_action" === 6)

// Join our predictions with the true set, using "inner" join provides us with rows that appear in both sets.
val cross_set = true_set.join(top_n_scored_df, 
                              (top_n_scored_df("vacancy_number") === true_set("vacancy_number"))
                                && (top_n_scored_df("candidate_number") === true_set("candidate_number")),
                              "inner")

val len_true = true_set.count()
val len_cross = cross_set.count()
if (len_true > 0) {
  // Simply use the lengths of the dataframes for our final score
  val final_percentage =  100 * len_cross / len_true.toDouble
  println(final_percentage.toString + "% of the applications were in the prediction set")
}

4.937611408199643% of the applications were in the prediction set
clicks_val: org.apache.spark.sql.DataFrame = [_c0: int, Unnamed: 0: int ... 10 more fields]
true_set: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_c0: int, Unnamed: 0: int ... 10 more fields]
cross_set: org.apache.spark.sql.DataFrame = [_c0: int, Unnamed: 0: int ... 32 more fields]
len_true: Long = 5610
len_cross: Long = 277


## Compare with a baseline
The final percentage itself does not say much, is it high? low? To have a better feeling about this we need to compare it with a baseline score. In our case we select the top 15 most popular vacancies for every candidate. This baseline is better than selecting vacancies at random and provides a bit of insight.

In [ ]:
val true_set = clicks_val.filter($"ecom_action" === 6)

// Select the 15 most applied vacancies
val popular_set = true_set.groupBy("vacancy_number")
                          .count()
                          .orderBy(desc("count"))
                          .limit(15)

// Sum the count
val pop_count = popular_set.agg(sum("count"))
                           .first
                           .getLong(0)

// The summed count is equal to the number of users that applied to the popular vacancies, dividing by the total gives us the recall.
val final_percentage =  100 * pop_count / true_set.count().toDouble
println(final_percentage.toString + "% of the applications were in the prediction set\n")

2.9055258467023175% of the applications were in the prediction set

true_set: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_c0: int, Unnamed: 0: int ... 10 more fields]
popular_set: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [vacancy_number: int, count: bigint]
pop_count: Long = 163
final_percentage: Double = 2.9055258467023175


## Start finetuning now!
You now have a working recommender and decent validation method. Try to improve performance by for instance:
* Tuning feature weight parameters
* Optimize ALS 
* Incorporate the profile data
* Add week hours as feature
* Filter out vacancies using rules (e.g. week hours don't fit the profile)