In [4]:
# Import packages
library(tidyverse)
library(bnlearn)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.3
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 1.0.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


# Make predictions for the next season

Read in a fitted Bayesian Network (BN) R object, and then use it to make predictions.

We want to make predictions for the following network nodes, depending on the season (given in the node name suffix):
* Early summer: TP_ES, chla_ES, cyano_ES
* Late summer: TP_LS, chla_LS, cyano_LS

The network nodes used in the predictions will vary according to season, see use_node_list variable below.

## Set up

In [5]:
# Read in fitted bayesian network R object
rfile_fpath = "../data/RData/Vansjo_fitted_seasonal_BN_1981-2017.rds"
fitted_BN = readRDS(rfile_fpath) # Read in fitted BN

# fitted_BN

In [6]:
# Read in and format data to use in making predictions

driving_data_fpath = "../data/DataMatrices/BN_dataForPrediction.csv"
data_for_prediction = read.csv(file=driving_data_fpath, header=TRUE, sep=",", row.names = 1)

# Use training data to set correct format for driving data
data_discretized_all = read.csv(file="../data/DataMatrices/Vansjo_Seasonal_Discretized_RegTree_all.csv",
                                header=TRUE, sep=",", row.names = 1)

# Convert from factors to ordered factors: for each columns, assign levels as follows depending
# on how many levels there are (key: number of levels, returns levels to use):
#     factor_li_dict = {2: ['L','H'],
#                      3: ['L','M','H'],
#                      4: ['VL','L','M','H'],
#                      5: ['VL','L','M','H','VH']}
data_discretized_all[] = mutate_all(data_discretized_all, ~ droplevels(factor(., order = TRUE, levels = c("VL", "L", "M", "H", "VH"))))

# Drop any columns which don't match the columns in the data for prediction
training_data = data_discretized_all[ , (names(data_discretized_all) %in% colnames(data_for_prediction))]

driving_data = training_data[0,] # New empty dataframe with right ordinal cols
driving_data[1, ] = data_for_prediction[1, ] # Populate dataframe with data for deriving prediction
driving_data

Unnamed: 0_level_0,chla_prevSummer,cyano_prevSummer,rainy_days_winter,TP_prevSummer,windDays_under_Q0.4_LS,windDays_over_Q0.6_LS
Unnamed: 0_level_1,<ord>,<ord>,<ord>,<ord>,<ord>,<ord>
2019,H,H,H,H,H,L


In [7]:
# Nodes to use in making predictions, according to season

use_nodes = nodes(fitted_BN) # Default is to use all network nodes. Maybe amend (see markup below)

**To do:**

(in python?)
Create list of nodes to use when making predictions.
For starters, just use all of them. But in the future, we will want to:

* Drop the current variable that is being predicted (during looping) from the use_nodes list, and then:
* If season is early summer, drop any nodes with suffix "_LS" (i.e. which apply to late summer) from use_nodes variable

Can then uncomment out the line of code within the 'predict' function below which has "from=use_nodes"

## Make prediction

At the moment this code just does this for a single variable. To do:

* Loop through nodes we want predictions for, which depends on the season:

    {'early summer':['TP_ES','chla_ES','cyano_ES'],<br>
    'late summer':['TP_LS','chla_LS','cyano_LS']}

* For each, predict and save probabilities of being in the different classes, as well as the overall classification

In [9]:
set.seed(1)

predicted_value = predict(fitted_BN,
                  data = driving_data,
                  node='chla_ES',
                  method='bayes-lw',
#                   from=use_nodes, # Activate this line once 'use_nodes' is correct (do in Python probably easiest)
                  prob=TRUE,
                  n=1000)

# Distribution of probabilities over the classes
probabilities = attr(predicted_value, "prob")

# Classification. If probabilities are tied between classes, this is randomly selected
classification = predicted_value[[1]]

print(probabilities)
print(classification)

   [,1]
L 0.008
H 0.992
[1] H
Levels: L < H
