### Date Predictions Model - Random Forest
This model is aimed to predict whether a user which have already converted at least once, will book a ticket within a given time-window. This is done by taking into account some features such as origin, destination, travelClass, LOS, DBD, pax and the value of the deal.

The model will learn from random set of users which have already booked at least one ticket in their past, and the model will output whether or not a user will book a ticket in a given time-window.
The idea of this model is for the client to know whether or not they should include specific users in different campagins or exclude them for specific periods and regions.

This model learns about the correlation between specific origins and destinations on the globe to booking dates.

In [None]:
%%cleanup -f

In [None]:
%%configure -f
{"driverMemory": "48G", "executorMemory": "8G", "executorCores": 2, "numExecutors": 50}

In [None]:
%%info

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("date_predictions_model") \
    .config("spark.driver.maxResultSize", "10g") \
    .getOrCreate()

In [None]:
spark.sql("set spark.sql.caseSensitive=true")

In [None]:
from pyspark.sql import SQLContext
sqlC = SQLContext(sc)

`directory` is the s3 bucket.

`model_dir` is the model location in the s3 bucket.

`featurized_pipeline_name` is the location of where the featured data can be read from.

`predictions_pipeline_name` is where the predictions will be written to.

In [None]:
assert directory = ''
assert model_dir = ''
assert featurized_pipeline_name = ''
assert predictions_pipeline_name = ''

#### read featurized dataframe from s3

In [None]:
file_loc = ('%s/%s/%s/*' % (directory, model_dir, featurized_pipeline_name))
ddf = spark.read.parquet(file_loc)

#### select relevant columns only

In [None]:
from pyspark.ml.feature import SQLTransformer
selector_transformer = SQLTransformer(statement='SELECT user_id, features, label FROM __THIS__')

#### shuffle dataframe

In [None]:
shuffle_dataframe = SQLTransformer(statement='SELECT * FROM __THIS__ ORDER BY RAND()')

#### add column with random values for splitting dataframe

In [None]:
rand = SQLTransformer(statement='SELECT *, RAND() AS random_var FROM __THIS__')

#### split dataframe to train/test sets (80/20)

In [None]:
split_data = SQLTransformer(statement='SELECT *, CASE WHEN random_var < 0.8 THEN "train" ELSE "test" END AS train_test FROM __THIS__')

In [None]:
from pyspark.ml.pipeline import Pipeline
pipeline = Pipeline(stages=[selector_transformer] + [shuffle_dataframe] + [rand] + [split_data])

In [None]:
fitted_model = pipeline.fit(ddf)

In [None]:
bookings_ddf = fitted_model.transform(ddf)

#### select rows for the training set - 80% of the dataframe

In [None]:
train_transformer = SQLTransformer(statement='SELECT user_id, features, label FROM __THIS__ WHERE train_test == "train"')

In [None]:
train_pipeline = Pipeline(stages=[train_transformer])

In [None]:
train_model = train_pipeline.fit(bookings_ddf)

In [None]:
train_booking_date_ddf = train_model.transform(bookings_ddf)

#### select rows for the testing set - 20% of the dataframe

In [None]:
test_transformer = SQLTransformer(statement='SELECT user_id, features, label FROM __THIS__ WHERE train_test == "test"')

In [None]:
from pyspark.ml.pipeline import Pipeline
test_pipeline = Pipeline(stages=[test_transformer])

In [None]:
test_model = test_pipeline.fit(bookings_ddf)

In [None]:
test_booking_date_ddf = test_model.transform(bookings_ddf)

#### Random Forest Classifier
explaination about the differnt hyperparameters: 

`numTrees` parameter is set to 50 due to it is the minimal number of trees in the forest while the performance of the model stays high (higher number of trees can improve performance of the model, but will cost more computation time).

`maxBins` parameter is related to the amount of values in the categorical features. It has to be at least as large as the number of values in each categorical feature (OneHotEncoded `origin`, OneHotEncoded `booking_type`).

`maxDepth` parameter is the maximum depth of the each tree in the forest, the deeper the tree, the higher the accuracy of the model (risking of overfitting the model on training set examples).

`minInfoGain` parameter is related to the minimum information gain for a split to be considered at a tree node. the higher the value, the more risk for underfitting the model.

In [None]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(labelCol="label", featuresCol="features",  probabilityCol='conversion_probability' numTrees=50, maxBins=1250, maxDepth=30, minInfoGain=0.0)

In [None]:
rf_model = rf.fit(train_booking_date_ddf)

#### the features importances of the model

In [None]:
rf_model.featureImportances

featureImportances: 

In [None]:
booking_date_predictions_ddf = rf_model.transform(test_booking_date_ddf)

#### Select relevant columns for evaluation of the model

In [None]:
predictions_transformer = SQLTransformer(statement='SELECT user_id, label, prediction, conversion_probability FROM __THIS__')

In [None]:
predictions_pipeline = Pipeline(stages=[predictions_transformer])

In [None]:
predictions_model = predictions_pipeline.fit(booking_date_predictions_ddf)

In [None]:
predictions_ddf = predictions_model.transform(booking_date_predictions_ddf)

#### write predictions to S3 as parquet in order to Evaluate using scala notebook

In [None]:
predictions_ddf.coalesce(1).write.format("parquet").mode("overwrite").save('%s/%s/%s' % (directory, model_dir, model_pipeline_name), header=True)