# New Hospitals Model

This notebook runs the NHP model and produces the raw results.

Note, this can take a very long time to run and load the resulting data. If you find that you are running out of RAM (especially when loading data) consider reducing the number of model runs.

In [None]:
params_file = "sample_params.json"
data_path = "data"
results_path = "results"

## Setup

Load the required packages

In [None]:
import os
import uuid

from datetime import datetime

from run_model import run_model

from model.aae import AaEModel
from model.inpatients import InpatientsModel
from model.outpatients import OutpatientsModel
from model.model_save import LocalSave
from model.helpers import load_params

We need to load in the params json file.

In [None]:
params = load_params(params_file)
# extract the number of model_runs the params calls for
model_runs = params["model_runs"]
# set the create_datetime
params["create_datetime"] = f"{datetime.now():%Y%m%d_%H%M%S}"

We will run the model in parallel. By default, use all available CPU cores. You can set this to a lower value to use less resources, but it will take longer to run the model.

In [None]:
cpus = os.cpu_count()
cpus

When we run the model in parallel it's slightly more efficient to run a batch of model runs. Batches of 4 or 8 seems to be most efficient. This value should be a power of 2.

In [None]:
batch_size = 2 ** 2
batch_size

When the model run's it will create separate files for each model run - we store these in a temporary location before combining later. This creates a unique path to store the model results in which can easily be deleted later.

In [None]:
results_path = os.path.join(results_path, str(uuid.uuid4()))
results_path

## Run the model

First, we create the model runner. The `run_model()` function expects the params dictionary, the path to the data, the path where the results will be saved, which model run to start at, how many model runs to perform, the number of CPU cores to use, and the size of the batches to run.

The function returns a function, which takes either `AaEModel`, `InpatientsModel`, or `OutpatientsModel`, depending on what type of model we want to run.

Note, we add one to the model runs. The "principal" model run is model run 0, and then we perform 1 to `model_runs` iterations of the model.

In [None]:
runner = run_model(
    params,
    data_path,
    LocalSave,
    results_path,
    0,
    model_runs + 1,
    cpus,
    batch_size
)

Now the runner is set up, we can run each of the types of models.

In [None]:
runner(AaEModel)

In [None]:
runner(OutpatientsModel)

In [None]:
runner(InpatientsModel)

## Load Results

We can now load in our results.

In [None]:
import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import pyarrow.dataset as ds

In [None]:
def load_dataset(activity_type):
  p = f"{results_path}/model_results/activity_type={activity_type}/"
  return (ds
    .dataset(p)
    .to_table()
    .to_pandas()
  )

In [None]:
aae = load_dataset("aae").drop(["rn"], axis = "columns")
aae

ip data needs to be handled slightly differently: we need to split out the op rows and add them back to the op dataset. we also need to join back to the baseline data to get the additional columns

In [None]:
def split_ip_op_data():
  op = load_dataset("op").drop(["rn", "is_surgical_specialty",	"is_adult", "type"], axis = "columns")
  ip = load_dataset("ip")
  ip_op_rows = ip["classpat"] == "-1"

  op_rows = (ip[ip_op_rows]
    .groupby(["age", "sex", "tretspef", "dataset", "scenario", "create_datetime", "model_run"], as_index = False)
    .agg({"rn": len})
    .rename(columns = {"rn": "attendances"})
    .assign(is_gp_ref = False, is_cons_cons_ref = False, is_first = False, has_procedures = True, tele_attendances = 0)
  )
  op_rows

  op_fixed = pd.concat([op, op_rows]).groupby([
    "age", "sex", "tretspef", "is_gp_ref", "is_cons_cons_ref", "is_first",
    "has_procedures", "dataset", "scenario", "create_datetime", "model_run"],
    as_index=False
  ).agg({"attendances": np.sum, "tele_attendances": np.sum})

  ip_rows = ip[~ip_op_rows]
  ip_baseline = pq.read_pandas(
    f"data/{params['input_data']}/ip.parquet",
    ["rn", "imd04_decile", "ethnos", "admidate", "epitype", "dismeth"]
  ).to_pandas()

  ip_fixed = ip_baseline.merge(ip_rows, on="rn")
  
  return(ip_fixed, op_fixed)

ip, op = split_ip_op_data()

ip

In [None]:
# we can load the change factors in like so. Note, the order of the rows is semi-important within each model_run:
# the "baseline" change_factor row must always come first. The other rows are then in the order that change factor
# was run within the model engine, but strictly do not need to be shown in that order.
change_factors = (ds.dataset(
    f"{results_path}/change_factors/",
    format = "csv",
    partitioning="hive"
  )
  .to_table()
  .to_pandas()
)
change_factors