## Tutorial Prerequisites:

It's recommended you have an empty dataset in GCP to play around with the FDMBuilder tools - so either create one now in preparation, or ask someone with the relevant priviledges to make one for you if you can't.

## Quick Jupyter notebooks primer

A jupyter notebook allows you to write markdown and execute python/R script in one document. 

This text has been written in a markdown cell (double click right here and you'll be able to edit the markdown). Running a markdown cell renders it (so displaying the markdown in it's non-scripted format).

Immediately below this is the first code cell - it contains script that imports all of the required python libraries to run the FDMBuilder. Any output from a code cell will be displayed immediately below the cell.

Be sure to write text and documentation in a markdown cell, and script in a code cell - otherwise you'll get some pretty colourful errors!

There are a bunch of controls to manage each cell in the notebook: the UI has buttons above that can run a code cell, change a code cell to a markdown cell or visa-versa, stop execution of a code cell, execute every cell in the notebook, and so on... Hover over each of the buttons above to see what they do. You can also perform all cell-related activities by selecting the `Cell` menu in the toolbar and choosing the relevant option.

However, hotkeys are usually the easiest way to quickly run code cells (and render markdown). Simply select a code cell and:

* press `ctrl+enter` to run the cell 
* press `shift+enter` to run the cell and move focus to the cell below
* press `ctrl+shift+enter` to run the cell and create a new code cell below

That should be enough to get started - plenty of other online guides exist if you want to get better acquainted with the jupyter notebook environment.

Get started by running the below code cell, which imports all the required python libraries for the FDMBuilder:

In [1]:
from FDMBuilder.FDMTable import *
from FDMBuilder.FDMDataset import *
from FDMBuilder.testing_helpers import *

## FDM Builder - The basics

Note: This guide assumes you're familiar with the term FDM and associated concepts.

The FDMBuilder library has been designed with the hope that a non-python user shouldn't (hopefully) have too much difficulty using the FDM tools to build a dataset from scratch. The workflow is split into two major steps:

1. Prepare the source tables
2. Build the FDM

Each step comes with it's own tool or helper that walks through the process of preparing and bulding an FDM dataset. Source tables are "built" or prepared for the FDM process with the `FDMTable` tool - this is a python "class" that contains all the bits and pieces needed to clean and prep a table for FDMing. Once all the source tables are ready, the FDM dataset itself is "built" using the `FDMDataset` tool - another python class responsible for drawing all the source tables together and building the standard FDM tables (person and observation_period).

We'll begin with the basics of using the FDMTable and FDMDataset tools to buld an FDM dataset. Once we're more comfortable with the python workflow, we can then move onto the more "advanced" functions that can streamline many of the more common cleaning/manipulation activities that pop up during the FDM process.

## FDMTable

To begin the FDM process, we need to prep each source table. This process ensures that:

1. The source table is copied to the FDM dataset location
2. person_ids are added to each entry
3. An event_start_date is added to each entry in a cleaned `DATETIME` format
4. If needed an event_end_date is added to each entry in a cleaned `DATETIME` format

To do this using the python FDMBuilder, you first need to define an individual FDMTable object for each of the source tables in your FDM dataset. The below is an example of the python script you would use to initialise an `FDMTable` object:

```
test_table_1 = FDMTable(
    source_table_id="SOURCE_TABLE.LOCATION_GOES_HERE",
    dataset_id="FDM_DATASET_ID_GOES_HERE"
)
```

The above code cell creates a new FDMTable object and stores it as a python variable `test_table_1` - the arguments when creating or initialising an FDMTable are:

* `source_table_id`: the id of the source table (hopefully that wasn't a surprise!). This can be in "project.dataset_id.table_id" form or just "dataset_id.table_id" form
* `dataset_id`: the id of the dataset in which you'll be copying/building your FDM dataset 

    Note: This is hopeuflly obvious, but you'll need to change "SOURCE_TABLE.LOCATION_GOES_HERE" and "FDM_DATASET_ID_GOES_HERE" for the actual GCP locations of the source table and the FDM dataset.
    
Let's have a go at initialising a dataset ourselves. We'll be using a pre-prepared source dataset that can be found in `CY_FDM_BUILDER_TESTS` and we'll start with `test_table_1`. Take a quick look in the BigQuery SQL workspace to check you have acess to the test dataset and that you can find `test_table_1` in there (if not give me - Sam - a shout). Next, you'll need to create your own dataset to store your test FDM tables - do that now. Then, replace `YOUR_FDM_DATASET_ID_GOES_HERE` in the below code cell with the actual id of the new dataset you just created, and then run the cell (shift+enter, or the play button above):

In [2]:
test_table_1 = FDMTable(
    source_table_id="CY_FDM_BUILDER_TESTS.test_table_1",
    dataset_id="CY_SAM_TEST"
)

Don't worry if that was a little anti-climactic - initialising an FDMTable doesn't actually do anything in GCP. For that you need to call one of the FDMTable's "methods". Methods are functions attached to a specific class, that update/manipulate/otherwise mess about with the related class. So, the FDMTable class has methods that do things like add columns to the associated table, delete columns, rename columns etc. etc.

To start, we'll look at the most of important of these methods `build` - fortunately it's also the easiest to get to grips with. Methods are called by specifying the class object, followed by a `.` and then the name of the method. So we call the `build` method on the above FDMTable we just defined by running:

```
test_table_1.build()
```

The `build` method is designed to walk the user through the process of preparing an FDM table, stopping each time user input is required. Each time the script stops, it will give a short explanation why and will ask for input with a bit of guidance on the input required.

Give it a try! Run the below cell to build your first FDM table - simply read what the build script says and enter the required input when asked:

In [3]:
test_table_1.build()

	 ##### BUILDING FDM TABLE COMPONENTS FOR test_table_1 #####
________________________________________________________________________________

1. Copying test_table_1 to CY_SAM_TEST:



    A copy of test_table_1 already exists in CY_SAM_TEST. 
    You can continue with the existing test_table_1 table in CY_SAM_TEST
    or make a fresh copy from the source dataset."   
    
    Continue with existing copy?
    > Type y or n:  y



    Continuing with existing copy of test_table_1

2. Adding person_id column:
    test_table_1 already contains person_id column

3. Adding event_start_date column:



    event_start_date column is already present.
    
    You can continue with the existing event_start_date column or rebuild
    a new event_start_date_column from scratch.
    
    Continue with existing event_start_date? 
    > Type y or n:  y



4. Adding event_end_date column:



    An event end date may or may not be relevant to this source data. For example, 
    hospital visits or academic school years have an end date as well as a start 
    date.

    If you're unsure weather or not the source data should include an event end 
    date, seek help from the CYP data team."

    Does this data have an event end date?"
    > Type y or n:  n


________________________________________________________________________________

	 ##### BUILD PROCESS FOR test_table_1 COMPLETE! #####



Hopefuly that went without a hitch! If not give me a shout... 

If you quickly take a look over at GCP and give your tab a quick refresh, you should notice that your test dataset now contains a copy of `test_table_1` and the table has a shiny new `event_start_date` column. You'll also notice the parser has taken a string with text and digits and has converted it into a SQL datetime - this should hopefully save a lot of manual faff in the long run!

This was a pretty simple example that didn't ask much of the FDMBuilder - the second example throws a couple more curveballs into the mix, but hopefully the build script should still guide you through. When you're ready, run the below code cell:

(as before, be sure to replace `YOUR_FDM_DATASET_ID_GOES_HERE` with your own test dataset) 

In [4]:
test_table_2 = FDMTable(
    source_table_id="CY_FDM_BUILDER_TESTS.test_table_2",
    dataset_id="CY_SAM_TEST"
)
    
test_table_2.build()

	 ##### BUILDING FDM TABLE COMPONENTS FOR test_table_2 #####
________________________________________________________________________________

1. Copying test_table_2 to CY_SAM_TEST:



    A copy of test_table_2 already exists in CY_SAM_TEST. 
    You can continue with the existing test_table_2 table in CY_SAM_TEST
    or make a fresh copy from the source dataset."   
    
    Continue with existing copy?
    > Type y or n:  y



    Continuing with existing copy of test_table_2

2. Adding person_id column:
    test_table_2 already contains person_id column

3. Adding event_start_date column:



    event_start_date column is already present.
    
    You can continue with the existing event_start_date column or rebuild
    a new event_start_date_column from scratch.
    
    Continue with existing event_start_date? 
    > Type y or n:  y



4. Adding event_end_date column:



    event_end_date column is already present.

    You can continue with the existing event_end_date column or rebuild a new 
    event_end_date_column from scratch.

    Continue with existing event_end_date?
    > Type y or n:  y


________________________________________________________________________________

	 ##### BUILD PROCESS FOR test_table_2 COMPLETE! #####



Hopefully you made it through that without too much issue. Like before, if you hop over to GCP and refresh your SQL workspace, you should see a `test_table_2` in your dataset. This time the FDMTable tool has done a little more work - it renamed the `wrong_digest` column, added `person_id`s from the digest column, and has parsed the `event_start_date`s and `event_end_date`s. Fab!

That's the basics of the table prep stage done. Now we can move on to actually building the FDM Dataset.

## FDMDataset

You'll probably note that, thus far, the FDMTable tool doesn't seem to have done anything all that dramatic. It's just copied a couple of tables into our FDM dataset, made sure the person_id is in good order and added a couple of dates. But, it's important that these boxes are ticked off properly before we try to build the rest of the FDM dataset - the `person` and `obesrvation_period` tables, and removing any problematic entries in our source data. 

The job of building the FDM dataset is given to the, aptly named, `FDMDataset` class. It works in a very similar way to the `FDMTable` class - you initialise it with some simple details, and then run a `.build()` method to have it work it's magic. Unlike the `FDMTable` however, the `FDMDatatset` can work said magic without the need for any user input. All that's required is:

a. A dataset for your FDM
b. Source tables that have been build using the `FDMTable` class/tool
c. No other tables that arent FDM source tables in the dataset (or tables that the `FDMDataset` class has built itself - but more on that later)

The dataset you've built for this tutorial *should* tick all those boxes - provided you haven't deviated off the path this notebook has been walking. If so, you're ready to build your dataset. First, initialise your `FDMDataset` instance by running the following cell (and, as before, replacing `YOUR_FDM_DATASET_ID_GOES_HERE` with the actual id of your test dataset):

In [5]:
test_dataset = FDMDataset(
    dataset_id="CY_SAM_TEST"
)

and then, as with the `FDMTable`, you just run the `.build()` method, and it takes care of the rest:

In [6]:
test_dataset.build()

		 ##### BUILDING FDM DATASET CY_SAM_TEST #####
________________________________________________________________________________

1. Checking dataset for source tables:

    * test_table_1 contains:  - person_id - event_start_date 
	-> Table ready
    * test_table_2 contains:  - person_id - event_start_date  * event_end_date
	-> Table ready
2. Building person table

    * Person table built with 186 entries

3. Separating out problem entries from source tables

    test_table_1:
	* 45 problem entries identified and removed to test_table_1_problems
	* 55 entries remain in test_table_1
    test_table_2:
	* 37 problem entries identified and removed to test_table_2_problems
	* 63 entries remain in test_table_2

4. Rebuilding person table

    * Person table built with 118 entries

5. Building observation_period table

    * observation_period table built with 118 entries

________________________________________________________________________________

	 ##### BUILD PROCESS FOR CY_SAM_TEST